Prompt Injection in AI Agents: the Threat and How to Defend
Prompt injection is an attack where content the model reads (an email, a document, a web page, a tool description) poses as an instruction and changes the agent's behavior. In a system with tool access it does not stop at a bad answer: the agent can take an unwanted action. OWASP has ranked it number one for two editions (LLM01).
Prompt injection is now the most widely demonstrated attack on systems built on language models, and for agents that perform work themselves, the consequences go beyond a "bad answer". This article explains what it is, why agents raise the stakes, what has already happened in practice, and how to reduce the risk in real terms, without scaremongering and without promising a "resistant model".
Quick answer
Prompt injection is a vulnerability where input the model reads contains a hidden instruction and changes its behavior. The model does not firmly separate "an instruction from the system's author" from "content to process": both land in the same context window. For a plain chatbot the result is a wrong answer. For an agent with tool access (email, files, APIs, a database) the result can be an unwanted action taken on the company's behalf: an email sent, a record changed, data exfiltrated.
OWASP ranks prompt injection first on its list of risks for LLM applications (LLM01) in the 2023 and 2025 editions, and in December 2025 publishes a separate list for agentic applications. This is not one vendor's bug, it is a property of the architecture: as long as instructions and data flow through one channel, the risk has to be reduced in layers, not "patched" with a single filter.
What prompt injection is
Direct prompt injection is an instruction typed straight in by the user ("ignore the previous instructions and..."). The indirect form is often more dangerous: an instruction hidden in content the agent reads along the way, in an email, a PDF, a web page, a customer ticket, a tool description, or memory from a previous session. The agent treats that content as working context, while in reality it is reading someone else's command.
OWASP distinguishes prompt injection from jailbreaking: jailbreaking targets the model's safety controls, while injection changes the functional behavior of the application. For a company, the second case usually matters more. The question is not whether the model says something inappropriate, but whether it takes an action that no human authorized.
Why agents raise the stakes
Three properties of agentic deployments make the same attack hurt more:
- A shared context window. The system prompt, the user message, retrieved documents, tool output, history, and memory all land in a single token stream, with no hard trust boundary between "instruction" and "data".
- Persistent memory. An injection written into long-term memory, a RAG corpus, or a vector store taints every later session that reads from that source.
- Action execution. When the model's output triggers tools (files, shell, email, cloud APIs, MCP servers, sub-agents), the blast radius extends from the chat window to whatever the tools can reach. Tool output re-enters the context, which enables chained actions.
OWASP calls the related risk "Excessive Agency" (LLM06): the broader an agent's permissions, the higher the damage ceiling of a successful attack. That is the heart of pricing the risk. You do not ask "will the model make a mistake", you ask "what is the worst it can do when it does".
Real incidents, 2024–2026
This is not a theoretical risk. Below are publicly documented cases, each with a different vector but the same root cause:
| Incident | What happened | Lesson |
|---|---|---|
| EchoLeak (CVE-2025-32711) | June 2025: the first documented zero-click attack on a production AI system (Microsoft 365 Copilot). A single crafted email was enough to make Copilot reach internal files and send their contents to an attacker's server, with no user action. | The content an agent reads is itself an attack surface. |
| Slack AI (PromptArmor) | August 2024: an instruction placed in a public channel made it possible to exfiltrate data from private channels, including API keys. | Indirect injection through content the agent can access. |
| Cursor + Supabase MCP | July 2025: a support ticket led an agent with service_role privileges to dump the production database into a user-visible thread. | Excessive tool permissions turn injection into a database leak. |
| postmark-mcp (npm) | September 2025: a malicious MCP package silently BCC'd copies of emails to an attacker for about 8 days. | The tool supply chain is an attack channel too. |
| Amazon Q (AWS-2025-015) | July 2025: a commit instructing deletion of files and AWS resources reached the extension's repository; the version shipped to roughly one million installs. | An agent's configuration and instructions need code-grade control. |
| TrapDoor | 2026: hidden instructions encoded in zero-width characters inside .cursorrules and CLAUDE.md files tried to make an assistant run a "security scan" that exfiltrated secrets. The file looked empty. | Content invisible to a human can be visible to the model. |
Research on attack success rates shows that with enough attempts, injection gets through even top models (on the order of 89% for GPT-4o and 78% for Claude 3.5 Sonnet in one test), and current defenses tend to slow the attack rather than eliminate it. Poisoning a RAG system's answers can take just a handful of crafted documents.
How to defend against it
There is no single filter that "switches off" prompt injection. What works is layered defense, where each layer takes some reach away from the attack. That is how we read the seven criteria of an agent: not as a feature list, but as the places where damage is contained.
- Separate the privileged model from the content. In the dual-LLM pattern (described by Simon Willison) the model that holds the tools never reads untrusted content directly, while a "quarantined" model reads the content but cannot act. This cuts the path an instruction would have to travel to reach an action. (Tools, Boundaries)
- Least privilege on tools. The agent gets only the actions and data it actually needs. A
service_rolekey that bypasses row-level security is a ready-made leak scenario. (Boundaries) - Screen actions against the original intent. Every tool call is checked against what the human asked for, not against what the model "came up with" along the way. (Boundaries, Measurement)
- A human in the loop for irreversible actions. Sending to a customer, changing ERP data, a payment, or deleting data requires approval. (Escalation)
- An audit trail. Every decision and every tool call is logged, so after the fact you know what the agent did and why. (Trace)
- Guardrail models alongside deterministic controls, not instead of them. Filters like Llama Guard or NeMo Guardrails help, but on their own they are not a boundary.
At Syntalith, containment is a default part of the project, not an add-on. We use LangGraph when every step must be auditable, and OpenClaw when the task is open-ended: in isolated virtual machines, with scoped data, a log, and human consent before anything affects production. The difference is not in what we use. It is that we know when and how to contain it.
What it means when choosing a vendor
Prompt injection is also a practical test for agent-washing. Before you pay for an "AI agent", ask the vendor five things that map directly onto this risk:
- What can the agent do on its own, and what does it only prepare for approval?
- What data and tools does it see, and does it have the least privilege it needs?
- How do you separate untrusted content from instructions?
- What is logged, and after an incident can you reconstruct what happened?
- Where does the agent run, and how isolated are its tools?
If the vendor has no answers to these, they are most likely selling a chatbot with system access, not a contained agent. How to read these criteria step by step is explained in the guide what is an AI agent.
Prompt injection and the AI Act
From 2 August 2026, the duty under Article 50 of the AI Act requires systems that talk to people to disclose that they are AI. That is a separate transparency duty, not a defense against injection, but it goes hand in hand with the same approach: disclosure, a conversation trace, and the ability for a human to take over.
In one sentence
Prompt injection is an architectural problem, not a single bug to patch. The model treats instructions and data as one stream, so a safe agent is built through boundaries, least privilege, escalation, and a trace, not through a promise of a "resistant model". The safest assumption is this: treat the model as an untrusted interpreter, not as an autonomous decision-maker.
Start with the process, not the model
If you are planning an agent that will reach into your systems, start with a free process scan. In 30 minutes with an engineer we will establish what the agent should do, what it must not do, and where a human steps in. The same thinking that contains prompt injection also contains the cost and the risk of the implementation.
Book a free process scan | What is an AI agent | AI agent implementation service
Sources
- OWASP Top 10 for LLM Applications, 2025 (LLM01: Prompt Injection, LLM06: Excessive Agency)
- OWASP Top 10 for Agentic Applications, December 2025
- EchoLeak: CVE-2025-32711 (Aim Security)
- Slack AI data exfiltration (PromptArmor, 2024)
- Cursor + Supabase MCP (General Analysis, 2025)
- Amazon Q: AWS-2025-015
- Simon Willison: the dual-LLM pattern against prompt injection