How we defend MCP tool outputs from prompt injection

When we built Arcjet’s MCP server, the obvious security boundaries were authentication, authorization, input validation, rate limits, and confirmation prompts for mutating tools. The Go MCP SDK provides excellent scaffolding and helps structure these requirements, but it doesn't do anything special about the tool response boundary.

In a normal API, JSON is data. In an agent workflow, JSON is context. A field called summary, suggestedActions, or reason may be read by a model and used to decide what to do next. If that field contains attacker-controlled text, the tool has become a prompt injection delivery mechanism.

Arcjet helps developers secure AI applications from abuse, which means our own systems process both arbitrary HTTP request data and requests from agentic clients. These include attacker controlled data such as the request path and headers, so indirect prompt injection through Arcjet's own MCP is a real concern.

Tool output is model input

Google recently published research on prompt injections on the web. They found prompt injection attempts appearing in public web content, ranging from pranks and SEO manipulation to agent deterrence, exfiltration attempts, and destructive instructions.

This is relevant as newer applications do not only read user prompts - web pages, emails, docs, logs, support tickets, issue comments, request metadata, and tool results are all potentially part of the input or output.

Our MCP tools allow Arcjet users to query request data in aggregate and for specific requests, so they include security-relevant data such as request paths, hosts, IPs, headers, and error details. Those fields are exactly where hostile input can show up.

This is easy to miss when building the tool output. For example, you could create a simple string response with:

summary: `Request to ${path} was denied because ${headerName} contained disallowed value: ${headerValue}`;

This would be bad because path, headerName and headerValue are all attacker controlled. Even though we sanitize the values before storing them, protecting against prompt injection isn't simply a case of applying the right encoding.

What the MCP spec says

The MCP tools spec already points in the right direction.

Tools are model-controlled, meaning a model can discover and invoke them automatically. The spec recommends human confirmation for sensitive operations, supports structured output through structuredContent, and lets tools define an outputSchema.

The security considerations are especially relevant: servers must validate inputs, implement access controls, rate limit tool invocations, and sanitize tool outputs. Clients should validate tool results before passing them to the LLM.

"Sanitize" is doing a lot of work there! We have to distinguish between Arcjet-provided guidance and attacker-controlled evidence.

Our answer was to make the trust boundary visible in the response shape so clients and models have an explicit signal about which fields are trusted guidance and which are untrusted evidence.

The pattern we use

Our rule is: trusted guidance must never contain untrusted text.

Trusted fields are generated only from server-controlled values: enums, counters, thresholds, static templates, and policy decisions. Raw evidence goes into explicitly untrusted fields. This looks something like:

type ExplainDecisionOutput = {
  summary: string;
  conclusion: "ALLOW" | "DENY" | "ERROR";
  reason: string;
  suggestedActions: string[];
  untrustedData: {
    path?: string;
    host?: string;
    reasonDetails?: string;
  };
};

So instead of:

summary: `Request to ${path} was denied because ${headerName} contained disallowed value: ${headerValue}`;

we use:

summary: "Request was denied by prompt injection detection.",
untrustedData: { path, headerName, headerValue }

The trusted summary is less descriptive, but it is safer. The raw evidence is still available for display, investigation, and debugging, but it is not presented as server-authored guidance. This gives the model enough context to reason about the decision while keeping the raw evidence in a separate, clearly labeled place. More sophisticated outputs explain the meaning of the values and what to do with them, but the actual values are always specifically separate.

Schema text is part of the defense

MCP’s outputSchema is not just developer experience - it's part of the security surface. The schema helps clients and models understand the structure of the result before they use it.

We use schema descriptions to label trust explicitly.

Fields like summary say they are derived from server-controlled enums and counters. Fields like untrustedData.path say they are attacker-controlled and display-only.

That does not make the model magically safe, but it makes the boundary explicit. The client and the model have fewer reasons to confuse raw evidence with instructions.

Testing the boundary

We also added regression tests for the main failure mode: attacker-controlled text crossing into trusted fields.

The tests inject hostile strings into realistic places: request paths, hosts, reason details, rule labels, rule IDs, bot categories, filter expressions, metadata values, and top paths in analytics output.

Then we assert those strings never appear in trusted fields such as summary, suggestedActions, recommendations, reason, or risk. If the value needs to be returned, it can appear only in untrustedData.

Runtime enforcement with Guards

The MCP output pattern protects what our tools say back to agents, but agents also fetch arbitrary web pages, process queue messages, summarize support tickets, and call tools that return untrusted text.

That is where Arcjet Guards fits.

Guards run Arcjet security rules inside tool handlers, queue workers, background jobs, and other non-HTTP code paths. There is no Request object. You pass the input directly and get a decision back.

In TypeScript:

import { launchArcjet, tokenBucket, detectPromptInjection } from "@arcjet/guard";

const arcjet = launchArcjet({ key: process.env.ARCJET_KEY! });

const userLimit = tokenBucket({
  label: "user.tool_call_bucket",
  bucket: "tool-calls",
  refillRate: 100,
  intervalSeconds: 60,
  maxTokens: 500,
});

const piRule = detectPromptInjection();

async function searchWeb(query: string, userId: string) {
  const decision = await arcjet.guard({
    label: "tools.search_web",
    metadata: { userId },
    rules: [userLimit({ key: userId, requested: 1 }), piRule(query)],
  });

  if (decision.conclusion === "DENY") {
    return { content: "[Blocked: unsafe tool input]" };
  }

  return doSearch(query);
}

Or in Python:

import os
from arcjet.guard import launch_arcjet, TokenBucket, DetectPromptInjection

arcjet = launch_arcjet(key=os.environ["ARCJET_KEY"])

user_limit = TokenBucket(
    label="user.task_bucket",
    bucket="task-calls",
    refill_rate=100,
    interval_seconds=60,
    max_tokens=500,
)

pi_rule = DetectPromptInjection()

async def process_task(user_id: str, message: str):
    decision = await arcjet.guard(
        label="tasks.generate",
        metadata={"user_id": user_id},
        rules=[user_limit(key=user_id, requested=1), pi_rule(message)],
    )

    if decision.conclusion == "DENY":
        raise RuntimeError(f"Blocked: {decision.reason}")

    return await run_task(message)

The important detail is placement. Call guard() inline where the operation happens: inside the tool handler, task processor, queue worker, or function where untrusted input enters the system. Configure the client and rules once at module scope. Use stable labels, buckets, and keys so decisions are observable and rate limits do not collide.

Prompt injection defense is not only about better model instructions - it's important to also make trust boundaries visible in code.

Checklist for agent tool builders

If you are building MCP tools or agent workflows:

Separate trusted guidance from untrusted evidence in every tool response.
Generate trusted fields only from enums, counters, static templates, and policy decisions.
Put raw web, request, metadata, error, and tool text under clearly labeled untrustedData.
Define an outputSchema and use schema descriptions to mark trust boundaries.
Sanitize and, where possible, validate tool results before they reach the LLM.
Scan fetched or user-supplied content before returning it to the model.
Add rate limits, timeouts, redirect limits, content-type checks, and max response sizes.
Require confirmation for mutating or externally visible actions.
Add adversarial tests for every trusted output field the model might read.