RAG Chatbot Security Testing featured-image

Your RAG-Based Chatbot Passed a Pentest. Is It Actually Secure?

A SaaS company came to us after passing a penetration test on their customer support chatbot. Clean report, no criticals, box checked for their SOC 2 auditor. The chatbot ran on a retrieval-augmented generation (RAG) pipeline pulling from a shared vector database across all their customers. Nobody had tested whether one tenant's chatbot session could retrieve chunks of another tenant's documents. It could. The original pentest never queried the vector store directly, because it wasn't scoped to look. This is the pattern we see most often when we're asked to review a RAG system after a "passed" security test: the report is accurate, and it's also answering the wrong question. RAG chatbot security testing needs to examine the retrieval layer, not just the conversation layer, and that distinction is where most existing pentests fall short.

A clean pentest report and a secure RAG system are not the same thing

Most penetration tests scoped for "the chatbot" test the chatbot the way testers test any web application: authentication, session handling, input validation, maybe a few prompt injection attempts typed into the chat window. All of that is legitimate work, and none of it touches the part of a RAG application that actually does the interesting work: the retrieval layer.

A RAG chatbot isn't one system. It's a pipeline. User input goes through an embedding model, gets converted into a vector, gets matched against a vector database, pulls back the nearest chunks of stored content, and stuffs those chunks into the prompt before the language model ever generates a response. A tester who only interacts with the chat interface never sees most of that pipeline. They see input and output. They don't see what got retrieved, from where, or whether the retrieval respected any access boundaries at all.

That gap matters because two OWASP Top 10 for LLM Applications (2025) categories sit almost entirely inside it: LLM08 (Vector and Embedding Weaknesses) and LLM02 (Sensitive Information Disclosure). Neither shows up reliably in a pentest that treats the chatbot as a black box with a text input field.


The RAG attack surface a chat-only test never reaches

A production RAG system has roughly six components, and each one is a place where something can go wrong before the model ever generates a word:

  • The corpus - the source documents, tickets, wikis, or knowledge base entries that get ingested
  • The embedding pipeline - the model and process that turns text into vectors
  • The vector store - where those vectors live, indexed for similarity search (Pinecone, Weaviate, pgvector, and similar)
  • The retriever - the query logic that decides which vectors are "close enough" to return
  • The reranker, if present - a secondary pass that reorders retrieved results before they reach the prompt
  • The augmented prompt - the final assembly of retrieved context plus user input, handed to the LLM

Every one of those is a candidate for testing in its own right. A chat-only pentest interacts with none of them directly. It sends messages and reads replies. It has no visibility into which documents got retrieved for a given query, whether a metadata filter actually scoped the search to the right tenant or user, or whether the corpus itself contains anything that shouldn't be retrievable by a general user in the first place.

RAG penetration testing has to include direct interaction with the retrieval layer: querying the vector store's API where reachable, testing whether access-control metadata is enforced at query time rather than assumed, and checking whether the corpus was scrubbed of sensitive material before it was embedded in the first place. None of that happens by typing questions into a chat widget.


inline-diagram-rag-pipeline-attack-surface

LLM08: Vector and Embedding Weaknesses, in practice

LLM08 covers vulnerabilities in how content gets converted into embeddings and how those embeddings are stored, indexed, and retrieved. In our engagements, three variants show up repeatedly.

Cross-tenant or cross-user leakage. This is the one from the scenario above, and it's the most common finding we see in multi-tenant RAG deployments. If tenant isolation is enforced only in the application layer and not at the vector query itself, a crafted or even accidental query can return chunks that belong to a different tenant's index or namespace. Vector similarity search doesn't inherently know or care who's asking.

Retrieval poisoning. If the ingestion pipeline accepts content from a source an attacker can influence, such as a support ticket, an uploaded document, or a public wiki page, that content gets embedded and stored alongside everything else. Once it's in the index, it can be retrieved and treated as trusted context for future queries, without ever going through the front-end chat interface at all.

Embedding inversion exposure. Vector embeddings are lossy but not random. Depending on the model and the amount of access an attacker has to the embedding pipeline or raw vectors, it can be possible to reconstruct fragments of the original content from its embedding alone. This matters most when raw vectors are exposed through an API, a debug endpoint, or an underprotected admin panel.

None of these require a single prompt injection attempt to be exploitable. They live in infrastructure and data flow, not in the conversation.


LLM02: Sensitive Information Disclosure through retrieval

LLM02 is broader than RAG, but RAG architectures give it a specific and underestimated shape. A general-purpose chatbot leaks sensitive information when it's coaxed into saying something it shouldn't. A RAG chatbot can leak sensitive information simply by doing its job correctly: retrieving the most semantically relevant chunk to a query, even when that chunk was never meant to be exposed to the person asking.

Semantic search doesn't require exact keyword matches. If a sensitive document, a salary spreadsheet, an incident report, an internal policy memo, ends up embedded in the same index the chatbot draws from, a sufficiently well-phrased question can retrieve it even without anyone trying to attack the system in a conventional sense. This is why corpus hygiene (what gets embedded, and with what access controls) is as much a part of RAG security as anything happening in the model.


A sanitized example: what this looks like in an engagement

To be clear about the boundaries here: nothing below is a working exploit, a real client detail, or a reproducible attack chain. It's a pattern we've observed across multiple engagements, described generically.

A B2B SaaS platform ran a single shared vector index across all customer accounts, with tenant separation intended to happen through a metadata field attached to each vector at ingestion time. The application layer filtered results by that metadata field when displaying them to users. During testing, we found that the vector database's own query API accepted similarity searches without requiring that metadata filter to be present, meaning a query issued slightly differently than the application normally issued it could return unfiltered results across tenant boundaries. The chatbot's front end never exposed this, because it always applied the filter correctly. The underlying data layer didn't enforce it independently.

The fix wasn't complicated: enforce tenant scoping as a mandatory, non-optional parameter at the vector store level, not just in application code that calls it. But nobody found the gap until someone tested the retrieval layer directly instead of only the chat interface sitting on top of it. That's the difference RAG-aware testing makes.


How the retrieval layer actually gets tested

Testing the retrieval layer directly means the assessment doesn't stop at the chat window. In practice, that involves a few distinct activities that a chat-only pentest never performs:

Direct interaction with the vector store's query interface, where it's reachable, to check whether access-control metadata is enforced independently of the application code that's supposed to apply it. If the only thing stopping a cross-tenant query is a filter the front end happens to add, that's not enforcement, it's a convention an attacker doesn't have to follow.

Corpus review, which means looking at what's actually been ingested into the index, not just what the chatbot is designed to surface. It's common to find that a corpus was built from an entire document repository or wiki export, including material nobody intended to make queryable, long before anyone thought about who might ask the right question to retrieve it.

Adversarial retrieval testing, which probes whether carefully phrased but ordinary-looking queries can pull back content that a keyword search never would have surfaced. Semantic similarity doesn't respect the boundaries a human reviewer would draw around "things this chatbot should talk about."

Ingestion-path testing, for pipelines that accept content from semi-trusted sources such as support tickets, uploaded files, or user-submitted feedback, to see whether that content can influence what gets embedded and later retrieved as trusted context for other users.

None of this requires exotic tooling. It requires treating the vector database as a system that needs its own access-control and data-exposure review, the same way a traditional pentest treats a SQL database, rather than a black box behind the chat interface.


Traditional pentest vs. RAG-aware AI pentest

DimensionTraditional web/API pentestRAG-aware AI pentest
Primary attack surfaceHTTP endpoints, auth, session handling, input validationCorpus, embedding pipeline, vector store, retriever, augmented prompt, plus the app layer
Typical toolingBurp Suite, network and web scanners, manual API testingDirect vector store queries, retrieval analysis, manual adversarial prompting, OWASP LLM Top 10 methodology
Access-control testingTests roles and permissions in the applicationTests whether access control is enforced at the data layer, not just assumed from the app
Data exposure focusSQL injection, IDOR, broken object-level accessCross-tenant retrieval, embedding inversion, corpus-level sensitive data exposure
Findings that get missed by the other approachRetrieval poisoning, tenant isolation in the vector index, semantic-search data leakageClassic infrastructure flaws unrelated to the AI pipeline (still needed alongside AI testing)
Compliance mappingGeneral web app security controlsDirect mapping to OWASP LLM Top 10 (2025), referenced increasingly in SOC 2 and NIST AI RMF reviews

Neither approach replaces the other. A RAG system still runs on infrastructure that needs conventional testing. The point is that a passed traditional pentest tells you almost nothing about whether LLM08 and LLM02 risks exist in your retrieval pipeline, because it was never built to look there.


What a RAG-aware assessment actually checks

A methodology built around the OWASP LLM Top 10 (2025) treats the vector layer as first-class scope, not an afterthought. In practice that means:

  • Reviewing how tenant or user isolation is enforced at the vector query level, not just in the UI
  • Testing whether the ingestion pipeline can be influenced by untrusted or semi-trusted sources
  • Assessing whether raw embeddings or vector metadata are exposed through any reachable API
  • Checking what's actually inside the corpus and whether sensitive material was embedded without corresponding access restrictions
  • Mapping every finding back to specific OWASP LLM Top 10 category IDs so the results are auditable, not just a narrative report

This is the same kind of structured, manual-first approach we bring to AI penetration testing engagements generally, applied specifically to the retrieval architecture that makes a RAG system a RAG system.


What this costs

Pricing depends on how deep the RAG integration goes and how much of your stack is in scope. Most engagements land in one of three tiers.

TierPrice rangeBest fit
StarterFrom $9,500Third-party model APIs with limited backend integration; first AI feature shipping
Professional$15,000–$35,000Active RAG pipelines, plugins, and internal tool integrations — where most production RAG chatbots belong
Enterprise$35,000–$75,000Proprietary models, complex agentic systems, full adversarial and infrastructure review

A RAG chatbot with a shared vector store and live customer data almost always sits in the Professional tier or above, because the retrieval layer itself needs dedicated testing time beyond a standard conversational assessment. Exact scope and pricing get confirmed on a short call, not guessed at from a blog post.


Why this matters for compliance, not just security

Auditors are starting to ask about this directly. SOC 2 logical access criteria and NIST's AI Risk Management Framework are both being applied to AI systems that handle customer data, and a RAG chatbot pulling from a shared knowledge base is exactly the kind of system that raises questions during a vendor security review. Being able to show that your retrieval layer was tested against a named framework, not just your chat interface, is a materially different answer to give a customer's security team or an auditor. The OWASP Top 10 for LLM Applications is increasingly the reference framework auditors expect, and the NIST AI Risk Management Framework provides the governance layer most compliance programs are building around it.

For a RAG system specifically, this means an auditor asking about logical access controls isn't just asking whether your application enforces roles correctly. They're increasingly asking whether the data your model retrieves is scoped the same way, and "we haven't checked the vector layer specifically" is a harder answer to give with each audit cycle.

Frequently asked questions about RAG Chatbot Security Testing

Does a standard penetration test cover RAG-specific risks?

Generally, no. A standard web or API pentest tests the chatbot's interface, authentication, and session handling. It doesn't typically query the vector database directly or assess whether tenant isolation is enforced at the data layer, which is where most RAG-specific vulnerabilities under LLM08 and LLM02 actually live.

What's the difference between LLM08 and LLM02 in a RAG context?

LLM08 (Vector and Embedding Weaknesses) covers flaws in how content is embedded, stored, and retrieved, things like poisoned indexes or cross-tenant leakage. LLM02 (Sensitive Information Disclosure) covers the outcome: sensitive data ending up in a response. In RAG systems the two frequently overlap, since a vector or embedding flaw is often the mechanism that produces the disclosure.

Can our RAG chatbot be tested without disrupting production?

Yes. Testing is scoped during a kickoff call, and where production testing carries risk, we work against staging or coordinate carefully with your team. This is standard practice for any AI penetration testing engagement, not just RAG-specific work.

We use a managed vector database provider. Doesn't that handle security for us?

The provider secures the infrastructure the database runs on. Your team is responsible for how you structure indexes, enforce (or fail to enforce) tenant isolation, and decide what gets embedded in the first place. Those decisions live in your architecture, not the vendor's.

How long does a RAG-focused AI security assessment take?

It depends on the number of indexes, the complexity of your retrieval logic, and whether agentic behavior is layered on top of the RAG pipeline. A single chatbot with one vector store moves faster than a multi-tenant platform with several knowledge bases feeding one model.

Do you need access to our vector database to test it properly?

Some level of access, or at minimum a documented understanding of your retrieval architecture, is required to test LLM08 risks meaningfully. This gets defined during scoping, along with rules of engagement that keep testing safe and contained.

What do we get at the end of a RAG chatbot security assessment?

A prioritized report mapped to OWASP LLM Top 10 categories, with reproduction steps, business impact analysis, and specific remediation guidance for each finding, plus a free retest once fixes are applied.


Get your RAG pipeline tested, not just your chatbot

A passed pentest is a real result, and it's not the same result as a tested retrieval layer. If your chatbot runs on RAG and you haven't specifically verified how vector search handles tenant isolation, retrieval poisoning, or embedding exposure, that's an open question your next security review or customer questionnaire is going to ask eventually. Better to have the answer before they do. Our AI penetration testing service is built around exactly this gap: retrieval-layer coverage most existing pentests never scope in.

Book a 30-minute scoping call and we'll tell you honestly whether your RAG architecture needs a dedicated assessment or whether your existing testing already covers it.

Leave a Comment

Scroll to Top
Pentest_Testing_Corp_Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.