A B2B SaaS company with roughly 400 employees had product knowledge scattered across Confluence spaces, Google Drive folders, and Zendesk macros. Support engineers and account executives routinely spent more than twenty minutes per complex question, often answering from outdated PDF exports that no one had time to refresh.
The leadership team wanted a single internal assistant with citations — not another chatbot that improvises policy answers. We delivered a production RAG knowledge assistant in eight weeks: retrieval-first architecture, department-scoped indexes, hybrid search with reranking, and a golden evaluation set of 180 questions owned by sales enablement.
Client context and constraints
The client operates in workflow automation software for mid-market operations teams. Their buyers care about implementation risk, which means internal answers must match the current product version and legal wording.
EU data residency was mandatory: vectors, inference, and logs had to stay inside the client VPC contract region. Legal blocked fine-tuning on internal policy text in phase one — every answer about HR, security, or commercial terms needed a visible source link.
Success was defined operationally: median time-to-answer under ninety seconds for tier-one internal queries, with a clear human escalation path when retrieval confidence was low.
Discovery and corpus design
Week one focused on inventory, not models. We mapped which Confluence spaces were authoritative, which Drive folders were abandoned, and which Zendesk macros duplicated policy text. Roughly twelve percent of content was excluded as deprecated or conflicting.
We introduced metadata every chunk carries: product line, document version, locale, and access group synced from Azure AD. Without that schema, multi-tenant-style isolation inside one company would have failed at the first reorg.
Enablement teams drafted the first eighty golden questions from real Slack threads — anonymized — plus forty “trap” questions designed to catch stale or ambiguous policy wording.
Solution architecture
- Ingestion: nightly Confluence and Drive sync, OCR for scanned PDFs, PII scrub on HR documents before indexing.
- Chunking: parent-child on policy PDFs — eight-hundred-token children for search, full section parent returned to the model.
- Storage: pgvector inside the client VPC with separate namespaces per department.
- Retrieval: hybrid BM25 plus dense vectors, Cohere rerank on the top thirty candidates.
- Generation: GPT-4 class model with a strict citation JSON template and hard refusal below score 0.42.
Implementation timeline
Weeks one and two: discovery, DPIA inputs, and golden set v1. Weeks three to five: ingestion pipelines, first index, retrieval tuning on held-out questions. Weeks six to eight: Slack slash command, shadow mode for admins, then a thirty-user pilot cohort.
Weeks nine to twelve hardened operations: on-call runbook for ingestion failures, weekly eval report shared with enablement, and a kill switch tested before each model provider upgrade.
Measured results after twelve weeks
| Metric | Before | After |
|---|---|---|
| Median time to answer (sampled) | 22 min | 1.4 min |
| Volume in #ask-product Slack | baseline | −38% |
| Faithfulness on golden set | n/a | 91% human-rated |
| Sessions escalated to human | n/a | 14% |
What worked and what we would repeat
Investing in metadata and access control delivered more lift than swapping LLM vendors. When product marketing renamed a module, updating the index fixed answers — retraining would not have.
Weekly evaluation on newly published Confluence pages caught “silent stale” failures after reorganizations. That rhythm belongs to the business owner, not only engineering.
Phase two scopes fine-tuning for ticket summarization tone only — facts remain in RAG. That separation keeps legal comfortable while still shortening agent handle time.
Replication guide for your organization
Start with one department, one language, and one index. Copy the pattern: nightly sync, hybrid retrieval, citation template, and a golden set with named owners. Expand locales or divisions only when weekly eval is stable for four consecutive weeks.
If you need help adapting connectors, residency, or evaluation design, our RAG pipeline service follows the same delivery model described in this case study.
Lessons for executives
The business sponsor attended weekly eval reviews — not only the launch demo. That kept investment tied to measurable deflection and time saved, not novelty.
Legal and security signed off because citations and refusal were non-negotiable requirements in phase one, not backlog items.
Technical debt avoided
The team resisted one-off scripts per data source. Connectors share metadata schema and monitoring, so adding SharePoint later did not rewrite Confluence ingestion.
Model upgrades run through the same golden set used at pilot — preventing “it worked last month” surprises.
Replication timeline for your team
Month one: golden questions and corpus cut. Month two: retrieval quality on held-out set. Month three: pilot UI and escalation. Month four: hardening and executive readout with before/after metrics.
Skip the temptation to index every drive folder day one — authority and freshness beat volume.
Frequently Asked Questions
- Eight weeks to a production pilot; operational hardening through week twelve.
- Not in phase one — citations and RAG met legal requirements.
- Yes — we adapt storage and inference to your VPC and compliance rules.
- Engineering for ingestion, eval, and guardrails — not the base LLM API line item.