We Tested 30+ AI Chatbots and Agents This Year. Here's What Broke Almost All of Them
A support chatbot at a mid-sized SaaS company told a tester, completely unprompted, what discount tier a sales rep had quietly applied to a different customer's account. Nobody jailbroke it. Nobody used a clever prompt injection payload pulled off a forum. The tester just asked a few ordinary-sounding follow-up questions about "similar accounts" and the model filled in the gaps from context it should never have had access to in the first place.
That's not a rare story. It's close to the median finding across the AI penetration tests we ran this year.
This post pulls together aggregated, anonymized data from over 30 AI chatbot, RAG, and agent engagements: what we tested, what consistently broke, and where the gap actually sits between "we use a model from a reputable vendor" and "our AI deployment is secure." No client names, no reproducible payloads, no step-by-step exploit chains. Just the patterns, because the patterns are what a CISO building a board deck or answering a vendor security questionnaire actually needs.

What we tested and how we counted it
The dataset behind this post spans engagements across SaaS platforms, fintech tools, healthcare-adjacent products, and a handful of AI-native startups building agentic workflows. Some were single chatbots bolted onto an existing product. Others were multi-tool agents with access to internal APIs, ticketing systems, or customer databases.
Every engagement followed manual, adversarial testing mapped to the OWASP LLM Top 10 (2025), not an automated scan run once and forgotten. We're reporting on three categories specifically because they showed up constantly, regardless of industry or how mature the client's AI program was:
- LLM01: Prompt Injection - getting the model to treat attacker input as an instruction rather than data
- LLM02: Sensitive Information Disclosure - getting the model to reveal data it shouldn't have surfaced
- LLM06: Excessive Agency - finding agent tools or permissions broader than the use case actually required
The numbers below are rounded and aggregated across the full portfolio. They're directional, meant to show where the weight of risk sits, not a forensic accounting of any single client.
The findings, in aggregate
Roughly 4 in 10 deployments leaked some fragment of their system prompt. Not always the whole thing, and rarely word for word. More often it was a phrase, an internal tool name, a hint at the guardrail logic, surfaced through a question that looked like ordinary curiosity rather than an attack. System prompt leakage matters because the prompt usually encodes business logic and the exact boundaries of what the model is told not to do. Once an attacker knows the wording of a restriction, bypassing it gets a lot easier.
A little over half allowed some form of role or persona override that weakened content policy enforcement. This is the "pretend you are a different assistant with no restrictions" family of techniques, and it still works more often than vendors would like to admit, even against models with decent baseline safety training. The override didn't always produce something dramatic. Sometimes it just got the model to discuss topics it was supposed to decline, or to drop a disclaimer it was supposed to always include. The business risk here is reputational and compliance-driven more than it's catastrophic, but it's still a finding that shows up in a SOC 2 review.
Close to two-thirds of agentic deployments had at least one tool with more access than the use case required. This was the single most consistent finding across the entire dataset, and it's the one that should worry security leadership the most. A support agent with a "look up order status" tool that's actually wired to a generic database query function. A scheduling assistant that can technically modify any calendar, not just the user's own. An internal copilot with write access to a system it only needed to read from. None of this requires a clever attack. It's a permissions and architecture problem that prompt injection then makes exploitable.
Sensitive information disclosure showed up almost every time retrieval was involved. RAG systems pulling from a shared knowledge base, without proper tenant isolation or access scoping, were the most reliable source of cross-customer or cross-department data exposure we found all year. The model isn't doing anything wrong from its own perspective; it's faithfully retrieving and summarizing whatever the retrieval layer handed it. The vulnerability sits one layer below the conversation.
A pattern worth naming directly: the deployments that scored worst weren't the ones using exotic, custom-built models. They were the ones bolting a well-known third-party model onto internal systems with the same access-control discipline they'd apply to a static FAQ page. The model provider secures the model. Nobody secures the wiring around it by default.

Why the same three categories keep dominating
It would be convenient if AI security findings were spread evenly across the OWASP LLM Top 10, because that would mean a single control fixed a single problem. They aren't. Prompt injection, sensitive information disclosure, and excessive agency keep showing up together because they share a root cause: most teams design the happy path first and the trust boundary later, if at all.
A chatbot gets built to answer customer questions. Someone wires it to a knowledge base so it can answer more questions accurately. Someone else gives it a tool so it can actually do something instead of just talking. Each step makes sense on its own. None of those steps comes with a built-in question of "what's the worst thing a user could get this thing to do with the access it now has," because that question belongs to security review, and security review often arrives after the feature has already shipped to a few pilot customers.
This is also why automated scanning tools consistently underperform against AI systems compared to how they perform against traditional web apps. A scanner can fuzz an input field and watch for a stack trace. It has a much harder time recognizing that a model just leaked a fragment of its own configuration in a sentence that reads as a normal, grammatically correct response. The failure doesn't look like a failure. It looks like the AI being helpful, which is exactly the problem.
How severity actually breaks down across these findings
Not every finding in this dataset carries the same weight, and treating them as uniformly dangerous does a disservice to anyone trying to prioritize remediation. A rough severity breakdown, based on business impact rather than technical novelty, looks something like this:
High severity: cases where excessive agency combined with a real data-modifying or data-exfiltrating tool. This is where an over-permissioned agent could plausibly take an action a human reviewer never approved, like modifying a record, sending an unauthorized message, or pulling a dataset outside its intended scope. These findings were less common than the others but carried the most weight when they appeared, and almost always traced back to a tool that was given broader API access than the specific feature needed.
Medium severity: sensitive information disclosure through retrieval, and system prompt leakage that exposed meaningful business logic rather than generic boilerplate. These don't usually let an attacker take an action, but they leak something that has real value, whether that's another customer's data or the exact wording of a guardrail an attacker can now plan around.
Lower severity, but still reportable: role and persona overrides that weakened content policy without exposing data or enabling an action. These matter most for compliance and reputational reasons. A chatbot that can be talked into dropping a required disclaimer or discussing an off-limits topic isn't usually a breach, but it's exactly the kind of finding an auditor or a journalist would flag, and it shows up in vendor security questionnaires more often than security teams expect.
The point of breaking it down this way is that an AI security report shouldn't read like a flat list of scary-sounding vulnerability names. The business question is always "what's the realistic worst case," and that question has a different answer for a leaked phrase of boilerplate than it does for an agent that can modify a financial record.
A sanitized example: how excessive agency actually plays out
Here's a composite scenario, built from patterns across multiple engagements rather than any single client, that illustrates how LLM06 typically surfaces in practice.
An internal support copilot was built to help agents look up customer order history. It had a single tool connected to a backend API. That API, however, accepted a customer ID parameter with no validation against which agent or session was making the request, because the original use case never anticipated the model would be the one constructing the request. During testing, a tester was able to get the assistant to retrieve order details for a customer ID that didn't belong to the active support session, simply by phrasing a request that referenced "the previous customer" in a way the model interpreted as a legitimate follow-up.
No injection payload was needed. No jailbreak phrase was needed. The vulnerability was the tool's permission scope, not the model's willingness to misbehave. That's the defining shape of excessive agency: the model did exactly what it was built to do, and what it was built to do was too broad.
Traditional pentest vs. AI pentest: why this kept getting missed
A lot of the organizations in this dataset had already completed a standard web application or API penetration test that year. Almost none of those engagements caught the findings above, because they weren't designed to.
| Traditional Web/API Pentest | AI/LLM Penetration Test | |
|---|---|---|
| Primary attack surface | HTTP requests, authentication, session handling, injection in structured queries | Natural language input, system prompts, retrieval pipelines, agent tool calls |
| Typical tooling | Vulnerability scanners, fuzzers, manual exploitation of known CVE classes | Manual adversarial prompting, persona and context manipulation, tool-permission mapping |
| What it catches | SQL injection, broken access control, misconfigurations, outdated dependencies | Prompt injection, system prompt leakage, agent over-permissioning, retrieval data exposure |
| What it misses | Anything that requires reasoning about model behavior under adversarial language | Classic infrastructure flaws unrelated to the AI integration layer |
| Compliance relevance | SOC 2, PCI DSS, general infrastructure controls | Increasingly referenced in SOC 2 (CC6.1, CC6.6) and vendor security questionnaires for AI features |
The two aren't competing disciplines. Most of the organizations in this dataset needed both, and the ones that had only run the traditional test had a real gap in their actual risk picture, one that often only became visible once the AI test started turning up findings.
What this means if you're the one signing off on AI risk
If you're a CTO or CISO reading this because you're staring down a vendor security questionnaire, a board deck on AI risk, or just a nagging feeling that your chatbot or agent hasn't actually been tested the way it should have been, here's the practical takeaway: the most common failures aren't exotic. They're permission boundaries that never got tightened once the prototype became production, and retrieval pipelines that were never scoped to the tenant or user asking the question.
That's good news in one sense. These are fixable, well-understood problems once someone actually looks for them. It's also exactly why testing matters: nobody finds a permissions gap by reading documentation. Someone has to try to walk through it.
AI penetration testing is built specifically to find this category of issue before it shows up in an incident report or a lost enterprise deal. If your AI feature has shipped and hasn't been adversarially tested against the OWASP LLM Top 10, the data above is a reasonable estimate of what's currently sitting in production.
Frequently asked questions about AI Chatbot Security Testing Results
What percentage of AI chatbots have security vulnerabilities?
Across the engagements behind this report, the overwhelming majority had at least one finding mapped to the OWASP LLM Top 10, most commonly excessive agency in agentic deployments and sensitive information disclosure in RAG systems. The specific percentage varies by architecture, but having zero findings was the exception, not the norm.
Is prompt injection actually exploitable in production, or is it mostly theoretical?
It's exploitable, but the real-world impact depends entirely on what the model is connected to. A standalone FAQ chatbot with no tool access and no sensitive data behind it has a low blast radius even if injection succeeds. An agent with email, database, or file access has a much higher one. This is why our AI penetration testing methodology weighs injection findings against what the system can actually do, not just whether the override worked.
Does using a reputable model provider like OpenAI or Anthropic make our AI feature secure by default?
No, and this is the single most common misconception we run into. The model provider secures the model itself. You're responsible for the prompts you write, the data you retrieve, the tools you connect, and the permissions those tools carry. That integration layer is where nearly every finding in this report originated.
How is sensitive information disclosure (LLM02) different from a normal data breach?
A traditional breach usually involves an attacker bypassing access controls directly. LLM02 findings often involve the AI system doing the bypassing on the user's behalf, voluntarily surfacing data through normal conversation because the underlying retrieval or context window wasn't properly scoped. It doesn't require credential theft or a network exploit, just a conversation that drifts somewhere it shouldn't.
We haven't had any AI security testing done yet. Where should we start?
Start with a scoping conversation rather than guessing at a package. The right starting point depends on whether you're using a third-party model API with light integration, a RAG pipeline, or a multi-tool agent, since each carries a different risk profile and testing approach. A 15-minute scoping call is usually enough to identify which tier fits and what a fixed-price engagement would look like.
Will an AI security assessment slow down our release schedule?
Properly scoped testing runs in parallel with development in most cases and doesn't require taking your AI feature offline. Engagements are typically scoped around staging environments or carefully coordinated production windows specifically so they don't disrupt your release cadence.
The bottom line
Thirty-plus engagements is enough to see the same handful of failure patterns repeat across industries, team sizes, and levels of AI maturity. The deployments that held up best weren't the ones with the fanciest models. They were the ones where someone had actually scoped tool permissions, isolated retrieval by tenant, and tested the system the way an attacker would talk to it, not just the way a developer expected a user to.
If your AI chatbot, copilot, or agent hasn't gone through that process yet, book a 15-minute scoping call and we'll walk through what testing would actually look like for your specific setup.

