Prompt Injection Blocking: How Pre-Execution Security Stops the Attack

← Back to Blog

Most people who worry about prompt injection are thinking about the wrong problem. They picture a chatbot returning offensive text, or a model being tricked into saying something it shouldn't. That's a real issue — but it's not the one that gets you compromised.

In an agentic system, prompt injection doesn't produce bad text. It produces malicious tool calls. And tool calls do things in the world: they read files, write to databases, make HTTP requests, execute shell commands. The output isn't a sentence you can discard. It's an action that already happened.

That distinction is why prompt injection blocking in agentic contexts requires a completely different approach than traditional input sanitization. This post breaks down how the attack actually works, where existing defenses fall short, and why pre-execution blocking at the tool call layer is the only backstop that reliably works.

How Prompt Injection Actually Works in Agentic Contexts

The classic prompt injection framing is: attacker smuggles instructions into user-controlled input, model follows those instructions instead of the developer's. That's accurate as far as it goes. What it misses is the consequence in an agentic context.

Modern AI agents are built around tool use. The model doesn't just respond — it decides what actions to take and calls functions to take them. When an agent reads a document, summarizes a web page, processes a file, or queries an external API, it's ingesting content from the environment. That content becomes part of the model's context. And anything in the model's context can influence its decisions.

This is what makes agentic prompt injection prevention so difficult. In a standard chatbot, you can sanitize user input before it reaches the model. In an agent, the model is designed to consume arbitrary external content — that's its job. You can't sanitize the web. You can't sanitize every file your user might ask the agent to read. The attack surface is, by design, open.

Injection vectors in agentic systems include: web pages the agent browses, documents it processes, API responses it parses, code repositories it reviews, emails it reads, and — relevant to MCP-based architectures — tool descriptions that ship with malicious instructions embedded in their metadata. Any text that enters the model's context window is a potential injection surface.

The Attack Chain: From Injected Text to Executed Tool Call

Understanding the full attack chain matters because it reveals exactly where defenses can and can't intervene. Here's how a typical LLM injection attack unfolds in an agentic system:

Step 1 — Delivery. The agent ingests content containing hidden or disguised instructions. This might be visible text formatted to look like a system message, invisible characters injected into an otherwise normal document, or off-screen HTML that the agent's vision capabilities still process.

Step 2 — Interpretation. The model processes the injected content alongside its legitimate context. Because the model has no reliable way to distinguish between instructions from its operator and instructions embedded in content, it treats both as potentially actionable. A well-crafted injection exploits this by framing the malicious instruction as a high-priority system directive.

Step 3 — Tool call formation. The model decides to act. It selects a tool — maybe bash, read_file, http_request, or send_email — and constructs the arguments. At this point the attack has already succeeded at the cognitive level. The question is whether anything stops the call before it fires.

Step 4 — Execution. Without a blocking layer, the tool call fires. The file gets read. The request goes out. The command runs. The window to prevent harm has closed.

This chain is why prompt injection defense aimed at Step 1 or Step 2 — input sanitization, instruction hierarchy, constitutional prompting — is necessary but not sufficient. You need a backstop at Step 3, before Step 4 happens.

Real-World Examples: Perplexity Comet and Invisible Screenshot Injection

These attacks aren't theoretical. Two documented cases illustrate the attack chain in practice.

The first involves Perplexity's Comet browser agent. Brave Security demonstrated in August 2025 how injected commands embedded in web content could instruct the agent to navigate to banking sites, extract saved passwords, and exfiltrate sensitive data to an attacker-controlled server. Separately, LayerX Research disclosed "CometJacking" — showing how a single malicious click could hijack Comet's session and expose user data. Perplexity has since published their own response acknowledging the attack class and describing mitigations. These incidents are cited here as documented real-world examples of the attack pattern — Shoofly had no involvement.

The second is what researchers call invisible screenshot injection — a documented attack pattern targeting vision-capable agents. When an agent uses a browser or screen-capture capability, it renders visual content and passes it to a multimodal model. Attackers can embed instructions in that content using techniques invisible to human users: CSS-hidden text, off-screen elements, or content rendered at zero opacity. The vision model reads it. The agent acts on it. The user never sees anything unusual.

Both attacks share the same structure: trusted channel, injected payload, tool call as the outcome. Neither would be stopped by input sanitization alone. The agent is doing exactly what it's supposed to do — reading web content, processing visual information — and the injection rides along for free. This is also directly relevant to coding agents that read repositories, review PRs, or process external documentation.

For a related class of attack delivered through tool metadata rather than content, see our post on MCP tool poisoning. The delivery mechanism differs; the exploit chain is the same.

Why Input-Layer Defenses Aren't Enough

The conventional response to prompt injection is to try to fix it at the input layer: better system prompts, stricter instruction hierarchies, content filtering before the model sees it, or training the model to be more skeptical of instructions in user-controlled content.

These approaches have genuine value. A well-structured system prompt that clearly delineates operator instructions from user content does reduce injection success rates. Models are getting better at recognizing and ignoring suspicious embedded instructions. This progress is real.

But none of it is deterministic. Models don't parse text with the precision of a rule engine. They have no internal firewall between "this is a system instruction" and "this is content I'm summarizing." Sufficiently creative injections — especially those that mimic the style of legitimate system messages, or that exploit attention patterns the model learned in training — will get through. The question isn't whether input-layer defenses reduce the attack surface. They do. The question is whether they're a complete defense. They're not.

There's also the tool description problem. MCP servers ship tool descriptions that the model reads at initialization. If those descriptions contain injected instructions — telling the model to always include certain data in API calls, or to prefer certain actions under specific conditions — no amount of input filtering will catch it, because the injection is part of the tool definition itself. You'd have to audit every tool description before loading it. Which is, incidentally, something a policy-based security layer can do.

The deeper problem is that input-layer defenses operate at the wrong point in the attack chain. They try to prevent the model from being influenced. A backstop at the tool call layer doesn't care whether the model was influenced — it asks whether the resulting action is permitted. That's a much simpler question to answer reliably.

Pre-Execution Blocking as the Backstop

Pre-execution blocking works at Step 3 of the attack chain: after the model has decided what tool to call, but before the runtime fires the call. At that moment, the full intent is visible — tool name, arguments, parameters — and a policy engine can evaluate it synchronously. If the call violates policy, it never executes. The violation is logged, an alert fires, and the agent gets back an error rather than a result.

This is architecturally different from detection. Detection observes what happened. Blocking determines what happens. For the categories of harm that matter most in agentic systems — data exfiltration, unauthorized file access, unexpected network egress, destructive shell commands — the only useful intervention is one that happens before the damage is done. That's what blocking buys you that detection doesn't.

In Shoofly's implementation, the block happens at the hook layer. Every tool call passes through the policy engine synchronously before dispatch. The policy rules are open and auditable — you can read exactly what's being evaluated, modify the rules for your threat model, and understand every decision the system makes. There's no black box between your agent and your policy.

Shoofly Basic is free and covers detection and alerting, with an open, auditable threat policy you can inspect and extend. Shoofly Advanced upgrades to full pre-execution blocking: calls that violate policy never fire, with real-time alerts via Telegram and desktop notifications, plus policy linting to catch configuration errors before they become gaps. See plans and pricing for details.

Pre-execution blocking doesn't replace input-layer defenses. It complements them. Run the best system prompt you can. Use a model that's been trained to resist injection. Apply content filtering where it makes sense. Then add a blocking layer at the tool call boundary as the final backstop — because that's the one point in the chain where policy can be enforced deterministically, regardless of what the model decided.

Prompt injection in agentic systems is a hard problem precisely because the attack surface is the agent's job. You can't close it without disabling the agent. What you can do is ensure that even when an injection succeeds at the cognitive level, it doesn't succeed at the execution level. That's the goal of agentic prompt injection prevention built around pre-execution blocking. And it's the only layer of defense where the answer to "was this stopped?" is reliably yes or no.

For implementation guidance, see the Shoofly guides.


See plans and pricing →

Add runtime security for Claude Code and OpenClaw agents:

curl -fsSL https://shoofly.dev/install.sh | bash

Related reading: MCP Tool Poisoning · Why We Block Instead of Detect · Agentic AI Security · Claude Code Security · MCP Security