Direct Prompt Injection in Production LLMs: A Pentester’s Walkthrough

Contents Overview

The Risk Is Already in Your Production Environment

A financial services company deploys a customer-facing AI assistant to handle account inquiries. The system prompt instructs it to only discuss account balances and recent transactions. A security researcher, during a routine pre-launch review, submits a single message: a carefully worded instruction that tells the model to ignore its original role and instead summarize the contents of its context window.

The model complies. It reveals business logic embedded in the system prompt, the existence of internal tool integrations, and behavioral constraints the product team had assumed were invisible to end users.

This isn’t a hypothetical designed to alarm you, it’s a pattern we encounter regularly across client engagements at Pentest Testing Corp. We’ve conducted penetration tests for over 257 organizations globally, and the integration between LLMs and production systems is where the most consequential vulnerabilities tend to live. Direct prompt injection is consistently among the first things we find, and often the one with the broadest downstream impact.

This post walks through how we actually test for it, what the findings typically look like, and what genuine mitigation involves.

direct-prompt-injection-examples-featured-image

What Direct Prompt Injection Actually Is

Prompt injection is classified as OWASP LLM01:2025, the top-ranked risk in the OWASP Top 10 for Large Language Model Applications. The category covers two distinct attack vectors.

Direct prompt injection originates from the user input field. The attacker’s instructions arrive exactly where legitimate user messages arrive, but instead of asking a question or making a request within the model’s intended scope, the attacker crafts input designed to override or redirect the model’s behavior. The goal is to make the LLM act on the attacker’s instructions rather than the developer’s.

Indirect prompt injection is different: malicious instructions are embedded in external content that the LLM retrieves and processes: a document, an email, a web page, a database record. The attack arrives through a trusted pipeline rather than directly from a user. Both fall under LLM01:2025, and both require testing, but they need different methodologies.

This post focuses on direct injection. What makes it interesting from a pentesting perspective isn’t that it’s exotic, it’s that it’s structurally inherent to how LLMs work. The model receives text, and it processes all of that text as input. There’s no hardware-enforced separation between “trusted instructions” and “untrusted user input” the way there would be in a traditional system call boundary. That architectural reality is what makes the class persistent.

OWASP LLM01 and LLM06: The Two Categories That Matter Most Here

When we scope an LLM penetration test, we map every finding to the OWASP LLM Top 10 (2025). For direct prompt injection engagements, two categories dominate the report.

LLM01:2025 – Prompt Injection

This is the primary classification for the vulnerability class itself. OWASP defines it as a vulnerability that occurs when user prompts alter the LLM’s behavior in unintended ways, potentially resulting in data leakage, privilege escalation, or execution of unauthorized actions.

What’s often underappreciated is that LLM01 isn’t just about getting a chatbot to say something it shouldn’t. In systems where the LLM can invoke tools, query databases, send emails, or trigger API calls, a successful prompt injection can chain into an action that has real-world consequences. The severity of the finding scales directly with what the model has permission to do.

LLM06:2025 – Excessive Agency

This is the amplifier. Excessive Agency refers to a condition where an LLM-based system holds more capability, autonomy, or permission than it actually needs to accomplish its intended function. When direct prompt injection succeeds in an LLM that has broad tool access, the impact category shifts from “information disclosure” to “unauthorized action in a live system.”

The two vulnerabilities combine predictably: LLM01 is the injection vector, LLM06 determines how far the damage propagates. A chatbot that can only display text is limited exposure even when injected. A copilot that can write to a database, call internal APIs, or access a file system is an entirely different risk surface.

Both categories are part of how we frame findings for , and both are referenced by auditors reviewing AI-enabled systems for SOC 2, ISO 27001, and similar compliance frameworks.

Jailbreak Testing vs. Prompt Injection Testing: Not the Same Thing

These terms get conflated in a lot of vendor content, and the conflation matters because the testing objectives are different.

Jailbreak testing evaluates whether a model can be coaxed into producing content that violates its safety policies: generating harmful content, bypassing content filters, ignoring ethical guardrails. The target is the model’s behavior toward prohibited outputs.

Prompt injection testing evaluates whether an attacker can override the application’s business logic and security controls by manipulating the model’s input. The target is the application’s intended constraints and the integrity of the system prompt.

You can have a model that resists jailbreaks completely but remains vulnerable to prompt injection against its production context. You can also have the reverse. They test different things, and a methodology that only covers one is incomplete. Our jailbreak testing methodology evaluates both, and the scope of each is defined during the pre-engagement scoping call to match the specific deployment architecture.

How We Approach Prompt Injection Testing: The Methodology

LLM penetration testing doesn’t follow the same workflow as web application testing. There’s no CVE database to cross-reference, no standardized scanner output to interpret. What we have instead is a structured process for probing how the model interprets and prioritizes competing instructions.

Phase 1: Reconnaissance and Context Mapping

Before writing a single test prompt, we need to understand what we’re testing. This means documenting the LLM’s intended role, understanding the system prompt’s scope (even when we can’t read it directly), identifying what tools or integrations the model has access to, and establishing what the expected behavior boundaries are.

This phase is often collaborative. If we’re working with a client’s internal team, we’ll want access to the system prompt, the tool definitions, and the deployment architecture. For black-box engagements, we reconstruct this through behavioral inference, asking the model questions that reveal how it’s been configured, what it knows about itself, and how it responds to edge cases.

Phase 2: Boundary Probing

With context established, we systematically probe the model’s response to inputs that push against its defined constraints. This includes:

Instruction override attempts: Prompts that directly instruct the model to disregard previous instructions or take on a new role. The goal isn’t to check if the model says “I can’t do that”, it’s to understand whether that refusal is robust under variations in phrasing, framing, and context.

Role reassignment patterns: Inputs that attempt to redefine the model’s identity, authority level, or operational context. This tests whether the model’s adherence to its configured role degrades when challenged persistently or creatively.

Constraint elicitation: Prompts designed to surface the model’s underlying instructions, tool definitions, or business logic, typically by asking it to describe, summarize, or explain aspects of its own configuration.

Context window poisoning patterns: In conversational systems with memory or multi-turn context, testing whether injected content in earlier turns can influence behavior in later turns.

Phase 3: Agent Tool Abuse Evaluation

When the LLM has access to tools, particularly any tool that writes data, makes network requests, or interfaces with external systems, we evaluate whether a successful injection can be chained into an unauthorized action. This is where LLM06 findings emerge.

The evaluation isn’t about demonstrating a full exploit chain. It’s about establishing whether the control boundaries between user intent and tool execution are enforced outside the model’s reasoning, or whether they depend entirely on the model reasoning correctly about what it should and shouldn’t do.

Phase 4: Reporting and OWASP Mapping

Every finding is documented with a clear description of the test input, the observed behavior, the deviation from intended behavior, and the mapped OWASP LLM category. We rate severity based on the combination of exploitability and potential impact given the model’s actual capabilities and access scope. Recommendations are specific to the architecture, not generic.

A Sanitized Attack Scenario

The following is a composite, illustrative scenario based on vulnerability patterns we encounter in real engagements. No client data, real system prompts, or working exploit payloads are included.

Deployment context: A SaaS company deploys an internal HR copilot that employees can ask questions about company policies, benefits, and onboarding procedures. The copilot is backed by a RAG system pulling from a private document store. The system prompt instructs the model to only answer HR-related questions and to decline requests outside that scope.

What a pentester observed: During boundary probing, the tester submitted inputs that reframed the model’s role, not with aggressive or obviously adversarial language, but by embedding the reframing within what appeared to be a continuation of a legitimate HR question. The injected content asked the model to describe what other types of documents it had access to.

Observed behavior: The model partially revealed the document categories indexed in its retrieval system, including category names that suggested the store contained more than HR documentation. It also responded to subsequent follow-up questions outside its intended scope without invoking the refusal behavior that had appeared on earlier direct attempts.

Why it happened: The model’s refusal behavior was conditioned on recognizing out-of-scope requests in isolation. When the request was embedded in a plausible HR context and framed as a follow-up, the refusal trigger didn’t activate.

What the finding mapped to: LLM01:2025 (direct injection via context-embedded instruction override) with a secondary note on LLM07:2025 (system prompt/configuration leakage through behavioral inference). Severity was rated moderate because the document store access in this specific deployment was read-only and the data exposed wasn’t highly sensitive, but the same pattern in a deployment with write access or PII-heavy retrieval would rate significantly higher.

What remediation addressed: The fix involved output filtering for responses that describe internal configuration, stricter scoping of the RAG retrieval corpus, and restructuring the system prompt to explicitly handle follow-up questions as independent scope evaluations rather than inheriting context from earlier in the conversation.

Traditional Pentest vs. LLM Pentest: What Changes

If you’ve had web application or API penetration tests done before, the LLM engagement will feel different in ways that matter for scoping, budgeting, and interpreting results. Here’s where the key differences land.

Dimension	Traditional Web/API Pentest	LLM Penetration Test
Vulnerability taxonomy	CVE database, CWE, OWASP Web Top 10	OWASP LLM Top 10 (2025), custom AI threat models
Automated tooling	Scanners (Burp Suite, Nikto, Nessus) do significant heavy lifting	Manual testing dominates; no equivalent scanner for reasoning-layer flaws
Attack surface definition	Endpoints, parameters, headers, session tokens	System prompt, context window, tool definitions, retrieval corpus, agent permissions
Finding reproducibility	Deterministic — same input produces same output	Probabilistic — LLMs are non-deterministic; findings require multiple confirmation runs
Severity calculation	CVSS score based on standard impact/exploitability factors	Severity depends heavily on what tools and data the model can access
Fix validation	Re-run the payload, verify the response changed	Requires adversarial re-testing across input variations; patch may not hold under rephrasing
Compliance mapping	Maps to PCI DSS, HIPAA, SOC 2 controls for web/network	Maps to OWASP LLM01–LLM10; increasingly referenced in SOC 2 and ISO 27001 AI addenda
Time to complete	5–15 days depending on scope	5–15 days; agent-heavy systems with complex tool chains take longer
Primary tester skill	Network protocols, web stack, exploitation frameworks	LLM architecture, prompt design, adversarial ML, behavioral analysis

The most consequential difference for clients to understand is the probabilistic nature of LLM behavior. A finding in a web application test is binary: the vulnerability is there or it isn’t. In LLM testing, a model might refuse an injection payload 90% of the time and comply 10% of the time. Both behaviors need to be reported, and the fix needs to address consistency, not just worst-case outputs.

What Good Defenses Actually Look Like

Prompt injection isn’t a bug you patch with a line of code. It’s a risk you architect against with multiple independent controls. Here’s what effective defense-in-depth looks like based on the OWASP LLM Top 10 guidance and the NIST AI Risk Management Framework (NIST AI RMF 1.0), which includes the 2024 Generative AI Profile (NIST AI 600-1) addressing specific generative AI risk management practices.

Least-Privilege Tool Access (Addresses LLM06)

The most reliable mitigation for injection-driven agent abuse is limiting what the model can do in the first place. An LLM that can only read from a single, scoped document corpus has a fundamentally smaller blast radius than one with write access to a CRM, email integration, and code execution capability. Scope tool permissions to the minimum required for the model’s actual use case. Review that scope when the use case changes.

Privilege Separation Between System and User Context

Architectural controls that treat system prompt content and user input as distinct trust domains, and enforce that distinction outside the model’s reasoning, reduce the attack surface meaningfully. This isn’t always straightforward in LLM deployments, but frameworks and middleware approaches exist that implement this separation at the orchestration layer.

Output Validation and Filtering

Responses that contain internal configuration language, tool schema references, or unusual structural patterns warrant interception before delivery. Output filtering is imperfect, it can be bypassed through rephrasing, but as one layer in a defense-in-depth architecture it reduces the exploitability of successful injections.

Human Approval for High-Impact Actions

For any agent action that writes data, sends communications, makes external API calls, or executes code, requiring explicit human confirmation before execution is the most reliable control against injection-driven abuse. It accepts that the model can be manipulated and compensates at the action layer rather than relying on the model to reason correctly under adversarial conditions.

Adversarial Testing as an Ongoing Practice

Security testing for LLMs isn’t a one-time event. Model behavior can shift across versions; system prompt changes introduce new boundary conditions; new tool integrations expand the attack surface. Periodic adversarial testing, not just at launch, is what the OWASP LLM Top 10 and NIST AI RMF both recommend, and it’s the practice that separates organizations that find injection vulnerabilities from those that have them found by someone else.

Frequently Asked Questions about Direct prompt injection.

What is direct prompt injection in LLMs?

Direct prompt injection is when a user crafts input that overrides or manipulates the LLM’s original system instructions. Instead of asking a question, the attacker reframes what the model is supposed to do, essentially hijacking its behavior mid-conversation. Unlike indirect injection (which arrives through retrieved documents or tool outputs), direct injection comes straight from the user prompt field.

What is the difference between direct and indirect prompt injection?

Direct prompt injection originates from the user input field and attempts to manipulate the model’s behavior in real time. Indirect prompt injection is embedded in external data sources, a retrieved document, an email, a web page, that the LLM processes as part of its context. Both are classified under OWASP LLM01:2025, but indirect injection is typically harder to detect because the malicious content enters through a trusted pipeline rather than a user-controlled field.

How do pentesters test for prompt injection in production systems?

Prompt injection testing follows a structured methodology: reconnaissance to understand the model’s role and access scope, followed by systematic boundary probing using instruction override attempts, role reassignment patterns, and constraint elicitation techniques. Pentesters document both successful manipulations and partial bypasses, assess whether injected instructions could reach downstream tools or agents, and map findings to OWASP LLM01:2025 and LLM06:2025. Output filtering, agent tool access, and guardrail consistency are tested across multiple input variations, because with LLMs, consistency of defense matters as much as the defense itself.

Does a content filter or system prompt protect against prompt injection?

Partially, but neither is sufficient on its own. System prompts define intended behavior, but LLMs don’t enforce a hard boundary between instructions and user input at the architectural level, sufficiently crafted user messages can override them. Content filters catch known patterns but are bypassable through rephrasing, encoding, or context manipulation. Defense-in-depth is the correct approach: least-privilege tool access, output validation, human-in-the-loop controls for high-risk actions, and adversarial testing to identify gaps before attackers do.

What is OWASP LLM01 and why does it matter?

OWASP LLM01:2025 is the top-ranked risk in the OWASP Top 10 for Large Language Model Applications (2025 edition). It covers prompt injection vulnerabilities, both direct and indirect, and is ranked first because it underpins many other LLM attack categories. An LLM01 vulnerability in an agentic system can enable privilege escalation, data exfiltration, or unauthorized actions in connected systems. Auditors, compliance teams, and security assessors increasingly reference LLM01 when evaluating AI-enabled products against frameworks like SOC 2 and ISO 27001.

How much does an LLM penetration test cost?

AI and LLM penetration testing at Pentest Testing Corp starts from $9,500 for a fixed-price engagement. Final pricing depends on the type of AI system, integration depth, the number of exposed LLM APIs and agent tools, and whether adversarial red team testing is included. There’s no hourly billing, no scope-creep surprises. Book a 30-minute scoping call to get a precise quote for your environment.

Conclusion and Next Steps

Prompt injection is the most prevalent vulnerability class we encounter across AI penetration testing engagements, and it’s the most structurally persistent, because it’s rooted in how language models process input, not in a specific implementation flaw that can be patched and forgotten.

What effective security looks like here isn’t blocking every possible injection payload. That’s not a realistic target. It’s architecting your deployment so that the impact of a successful injection is bounded: limited tool access, action confirmation gates, output filtering, and regular adversarial testing to catch gaps before they matter.

If your organization is running a chatbot, copilot, RAG application, or AI agent in production, or planning to your AI penetration testing engagement should include explicit prompt injection test cases, agent tool abuse evaluation, and a clear map of findings to the OWASP LLM Top 10 (2025). That’s what we deliver, on a fixed-price basis, for organizations across financial services, e-commerce, and technology.

Ready to understand your actual exposure? Book a 15-minute scoping call with our Team Lead Shofiur Rahman. We’ll scope the engagement, answer your technical questions, and give you a fixed-price quote – no sales process, no obligation.

Direct Prompt Injection in Production LLMs: A Pentester’s Walkthrough

The Risk Is Already in Your Production Environment

What Direct Prompt Injection Actually Is