Direct Prompt Injection in Production LLMs: A Pentester’s Walkthrough
The Risk Is Already in Your Production Environment
A financial services company deploys a customer-facing AI assistant to handle account inquiries. The system prompt instructs it to only discuss account balances and recent transactions. A security researcher, during a routine pre-launch review, submits a single message: a carefully worded instruction that tells the model to ignore its original role and instead summarize the contents of its context window.
The model complies. It reveals business logic embedded in the system prompt, the existence of internal tool integrations, and behavioral constraints the product team had assumed were invisible to end users.
This isn’t a hypothetical designed to alarm you, it’s a pattern we encounter regularly across client engagements at Pentest Testing Corp. We’ve conducted penetration tests for over 257 organizations globally, and the integration between LLMs and production systems is where the most consequential vulnerabilities tend to live. Direct prompt injection is consistently among the first things we find, and often the one with the broadest downstream impact.
This post walks through how we actually test for it, what the findings typically look like, and what genuine mitigation involves.

What Direct Prompt Injection Actually Is
Prompt injection is classified as OWASP LLM01:2025, the top-ranked risk in the OWASP Top 10 for Large Language Model Applications. The category covers two distinct attack vectors.
Direct prompt injection originates from the user input field. The attacker’s instructions arrive exactly where legitimate user messages arrive, but instead of asking a question or making a request within the model’s intended scope, the attacker crafts input designed to override or redirect the model’s behavior. The goal is to make the LLM act on the attacker’s instructions rather than the developer’s.
Indirect prompt injection is different: malicious instructions are embedded in external content that the LLM retrieves and processes: a document, an email, a web page, a database record. The attack arrives through a trusted pipeline rather than directly from a user. Both fall under LLM01:2025, and both require testing, but they need different methodologies.
This post focuses on direct injection. What makes it interesting from a pentesting perspective isn’t that it’s exotic, it’s that it’s structurally inherent to how LLMs work. The model receives text, and it processes all of that text as input. There’s no hardware-enforced separation between “trusted instructions” and “untrusted user input” the way there would be in a traditional system call boundary. That architectural reality is what makes the class persistent.
OWASP LLM01 and LLM06: The Two Categories That Matter Most Here
When we scope an LLM penetration test, we map every finding to the OWASP LLM Top 10 (2025). For direct prompt injection engagements, two categories dominate the report.
LLM01:2025 – Prompt Injection
This is the primary classification for the vulnerability class itself. OWASP defines it as a vulnerability that occurs when user prompts alter the LLM’s behavior in unintended ways, potentially resulting in data leakage, privilege escalation, or execution of unauthorized actions.
What’s often underappreciated is that LLM01 isn’t just about getting a chatbot to say something it shouldn’t. In systems where the LLM can invoke tools, query databases, send emails, or trigger API calls, a successful prompt injection can chain into an action that has real-world consequences. The severity of the finding scales directly with what the model has permission to do.
LLM06:2025 – Excessive Agency
This is the amplifier. Excessive Agency refers to a condition where an LLM-based system holds more capability, autonomy, or permission than it actually needs to accomplish its intended function. When direct prompt injection succeeds in an LLM that has broad tool access, the impact category shifts from “information disclosure” to “unauthorized action in a live system.”
The two vulnerabilities combine predictably: LLM01 is the injection vector, LLM06 determines how far the damage propagates. A chatbot that can only display text is limited exposure even when injected. A copilot that can write to a database, call internal APIs, or access a file system is an entirely different risk surface.
Both categories are part of how we frame findings for , and both are referenced by auditors reviewing AI-enabled systems for SOC 2, ISO 27001, and similar compliance frameworks.
Jailbreak Testing vs. Prompt Injection Testing: Not the Same Thing
These terms get conflated in a lot of vendor content, and the conflation matters because the testing objectives are different.
Jailbreak testing evaluates whether a model can be coaxed into producing content that violates its safety policies: generating harmful content, bypassing content filters, ignoring ethical guardrails. The target is the model’s behavior toward prohibited outputs.
Prompt injection testing evaluates whether an attacker can override the application’s business logic and security controls by manipulating the model’s input. The target is the application’s intended constraints and the integrity of the system prompt.
You can have a model that resists jailbreaks completely but remains vulnerable to prompt injection against its production context. You can also have the reverse. They test different things, and a methodology that only covers one is incomplete. Our jailbreak testing methodology evaluates both, and the scope of each is defined during the pre-engagement scoping call to match the specific deployment architecture.
How We Approach Prompt Injection Testing: The Methodology
LLM penetration testing doesn’t follow the same workflow as web application testing. There’s no CVE database to cross-reference, no standardized scanner output to interpret. What we have instead is a structured process for probing how the model interprets and prioritizes competing instructions.
Phase 1: Reconnaissance and Context Mapping
Before writing a single test prompt, we need to understand what we’re testing. This means documenting the LLM’s intended role, understanding the system prompt’s scope (even when we can’t read it directly), identifying what tools or integrations the model has access to, and establishing what the expected behavior boundaries are.
This phase is often collaborative. If we’re working with a client’s internal team, we’ll want access to the system prompt, the tool definitions, and the deployment architecture. For black-box engagements, we reconstruct this through behavioral inference, asking the model questions that reveal how it’s been configured, what it knows about itself, and how it responds to edge cases.
Phase 2: Boundary Probing
With context established, we systematically probe the model’s response to inputs that push against its defined constraints. This includes:
Instruction override attempts: Prompts that directly instruct the model to disregard previous instructions or take on a new role. The goal isn’t to check if the model says “I can’t do that”, it’s to understand whether that refusal is robust under variations in phrasing, framing, and context.
Role reassignment patterns: Inputs that attempt to redefine the model’s identity, authority level, or operational context. This tests whether the model’s adherence to its configured role degrades when challenged persistently or creatively.
Constraint elicitation: Prompts designed to surface the model’s underlying instructions, tool definitions, or business logic, typically by asking it to describe, summarize, or explain aspects of its own configuration.
Context window poisoning patterns: In conversational systems with memory or multi-turn context, testing whether injected content in earlier turns can influence behavior in later turns.
Phase 3: Agent Tool Abuse Evaluation
When the LLM has access to tools, particularly any tool that writes data, makes network requests, or interfaces with external systems, we evaluate whether a successful injection can be chained into an unauthorized action. This is where LLM06 findings emerge.
The evaluation isn’t about demonstrating a full exploit chain. It’s about establishing whether the control boundaries between user intent and tool execution are enforced outside the model’s reasoning, or whether they depend entirely on the model reasoning correctly about what it should and shouldn’t do.
Phase 4: Reporting and OWASP Mapping
Every finding is documented with a clear description of the test input, the observed behavior, the deviation from intended behavior, and the mapped OWASP LLM category. We rate severity based on the combination of exploitability and potential impact given the model’s actual capabilities and access scope. Recommendations are specific to the architecture, not generic.
A Sanitized Attack Scenario
The following is a composite, illustrative scenario based on vulnerability patterns we encounter in real engagements. No client data, real system prompts, or working exploit payloads are included.
Deployment context: A SaaS company deploys an internal HR copilot that employees can ask questions about company policies, benefits, and onboarding procedures. The copilot is backed by a RAG system pulling from a private document store. The system prompt instructs the model to only answer HR-related questions and to decline requests outside that scope.
What a pentester observed: During boundary probing, the tester submitted inputs that reframed the model’s role, not with aggressive or obviously adversarial language, but by embedding the reframing within what appeared to be a continuation of a legitimate HR question. The injected content asked the model to describe what other types of documents it had access to.
Observed behavior: The model partially revealed the document categories indexed in its retrieval system, including category names that suggested the store contained more than HR documentation. It also responded to subsequent follow-up questions outside its intended scope without invoking the refusal behavior that had appeared on earlier direct attempts.
Why it happened: The model’s refusal behavior was conditioned on recognizing out-of-scope requests in isolation. When the request was embedded in a plausible HR context and framed as a follow-up, the refusal trigger didn’t activate.
What the finding mapped to: LLM01:2025 (direct injection via context-embedded instruction override) with a secondary note on LLM07:2025 (system prompt/configuration leakage through behavioral inference). Severity was rated moderate because the document store access in this specific deployment was read-only and the data exposed wasn’t highly sensitive, but the same pattern in a deployment with write access or PII-heavy retrieval would rate significantly higher.
What remediation addressed: The fix involved output filtering for responses that describe internal configuration, stricter scoping of the RAG retrieval corpus, and restructuring the system prompt to explicitly handle follow-up questions as independent scope evaluations rather than inheriting context from earlier in the conversation.
Traditional Pentest vs. LLM Pentest: What Changes
If you’ve had web application or API penetration tests done before, the LLM engagement will feel different in ways that matter for scoping, budgeting, and interpreting results. Here’s where the key differences land.
| Dimension | Traditional Web/API Pentest | LLM Penetration Test |
|---|---|---|
| Vulnerability taxonomy | CVE database, CWE, OWASP Web Top 10 | OWASP LLM Top 10 (2025), custom AI threat models |
| Automated tooling | Scanners (Burp Suite, Nikto, Nessus) do significant heavy lifting | Manual testing dominates; no equivalent scanner for reasoning-layer flaws |
| Attack surface definition | Endpoints, parameters, headers, session tokens | System prompt, context window, tool definitions, retrieval corpus, agent permissions |
| Finding reproducibility | Deterministic — same input produces same output | Probabilistic — LLMs are non-deterministic; findings require multiple confirmation runs |
| Severity calculation | CVSS score based on standard impact/exploitability factors | Severity depends heavily on what tools and data the model can access |
| Fix validation | Re-run the payload, verify the response changed | Requires adversarial re-testing across input variations; patch may not hold under rephrasing |
| Compliance mapping | Maps to PCI DSS, HIPAA, SOC 2 controls for web/network | Maps to OWASP LLM01–LLM10; increasingly referenced in SOC 2 and ISO 27001 AI addenda |
| Time to complete | 5–15 days depending on scope | 5–15 days; agent-heavy systems with complex tool chains take longer |
| Primary tester skill | Network protocols, web stack, exploitation frameworks | LLM architecture, prompt design, adversarial ML, behavioral analysis |
The most consequential difference for clients to understand is the probabilistic nature of LLM behavior. A finding in a web application test is binary: the vulnerability is there or it isn’t. In LLM testing, a model might refuse an injection payload 90% of the time and comply 10% of the time. Both behaviors need to be reported, and the fix needs to address consistency, not just worst-case outputs.
What Good Defenses Actually Look Like
Prompt injection isn’t a bug you patch with a line of code. It’s a risk you architect against with multiple independent controls. Here’s what effective defense-in-depth looks like based on the OWASP LLM Top 10 guidance and the NIST AI Risk Management Framework (NIST AI RMF 1.0), which includes the 2024 Generative AI Profile (NIST AI 600-1) addressing specific generative AI risk management practices.
Least-Privilege Tool Access (Addresses LLM06)
The most reliable mitigation for injection-driven agent abuse is limiting what the model can do in the first place. An LLM that can only read from a single, scoped document corpus has a fundamentally smaller blast radius than one with write access to a CRM, email integration, and code execution capability. Scope tool permissions to the minimum required for the model’s actual use case. Review that scope when the use case changes.
Privilege Separation Between System and User Context
Architectural controls that treat system prompt content and user input as distinct trust domains, and enforce that distinction outside the model’s reasoning, reduce the attack surface meaningfully. This isn’t always straightforward in LLM deployments, but frameworks and middleware approaches exist that implement this separation at the orchestration layer.
Output Validation and Filtering
Responses that contain internal configuration language, tool schema references, or unusual structural patterns warrant interception before delivery. Output filtering is imperfect, it can be bypassed through rephrasing, but as one layer in a defense-in-depth architecture it reduces the exploitability of successful injections.
Human Approval for High-Impact Actions
For any agent action that writes data, sends communications, makes external API calls, or executes code, requiring explicit human confirmation before execution is the most reliable control against injection-driven abuse. It accepts that the model can be manipulated and compensates at the action layer rather than relying on the model to reason correctly under adversarial conditions.
Adversarial Testing as an Ongoing Practice
Security testing for LLMs isn’t a one-time event. Model behavior can shift across versions; system prompt changes introduce new boundary conditions; new tool integrations expand the attack surface. Periodic adversarial testing, not just at launch, is what the OWASP LLM Top 10 and NIST AI RMF both recommend, and it’s the practice that separates organizations that find injection vulnerabilities from those that have them found by someone else.
Frequently Asked Questions about Direct prompt injection.
Conclusion and Next Steps
Prompt injection is the most prevalent vulnerability class we encounter across AI penetration testing engagements, and it’s the most structurally persistent, because it’s rooted in how language models process input, not in a specific implementation flaw that can be patched and forgotten.
What effective security looks like here isn’t blocking every possible injection payload. That’s not a realistic target. It’s architecting your deployment so that the impact of a successful injection is bounded: limited tool access, action confirmation gates, output filtering, and regular adversarial testing to catch gaps before they matter.
If your organization is running a chatbot, copilot, RAG application, or AI agent in production, or planning to your AI penetration testing engagement should include explicit prompt injection test cases, agent tool abuse evaluation, and a clear map of findings to the OWASP LLM Top 10 (2025). That’s what we deliver, on a fixed-price basis, for organizations across financial services, e-commerce, and technology.
Ready to understand your actual exposure? Book a 15-minute scoping call with our Team Lead Shofiur Rahman. We’ll scope the engagement, answer your technical questions, and give you a fixed-price quote – no sales process, no obligation.

