An ecommerce brand handling about twelve thousand support tickets per month wanted meaningful tier-zero deflection without damaging customer satisfaction. Their previous bot sounded confident but invented return windows and restocking fees — exactly the failure mode that destroys trust in regulated retail.
We implemented RAG grounded on the public help center and signed policy PDFs, plus tool calls to Shopify for order status. Billing disputes and chargebacks always route to humans. After ninety days, deflection on eligible intents reached twenty-eight percent with CSAT within two points of the human-only baseline.
Business challenge
Peak season multiplied queue times. Agents relied on inconsistent macros, so two customers with the same issue could receive contradictory answers. Leadership required query audit logs and EU hosting before any customer-facing automation went live.
The operations team did not ask for “more AI.” They asked for fewer repeat questions about shipping status, sizing, and return eligibility — without increasing reopen rates or refund errors.
Scope and guardrails
Phase one deliberately excluded open-ended sales advice and negotiation. We indexed twelve hundred help articles and forty policy documents with version metadata and effective dates.
A lightweight intent router classifies billing, shipping, product, and “other” before retrieval runs. That prevents shipping macros from polluting billing answers.
Shopify Admin API tools fetch order status and tracking — customer PII never enters the prompt as raw text. Zendesk sidebar shows suggested replies with mandatory source links so agents can send or edit.
Rollout plan
- Weeks 1–2: index build and internal QA on historical tickets.
- Week 3: five percent traffic shadow mode — bot suggests, human sends.
- Week 4: twenty percent with live monitoring and daily QA sample.
- Week 6: full eligible intents with tested kill switch.
Outcomes and economics
Twenty-eight percent deflection on eligible intents translated to roughly thirty-four hundred tickets per month avoided at an internal blended cost near twelve dollars per ticket — capacity value that exceeds the LLM API bill.
Average agent handle time dropped eleven percent because agents stopped searching three tabs; they started from a cited draft. CSAT on bot-handled threads stayed within two points of pre-automation baseline.
Grounding and refusal rules beat a bigger model. GPT-4 with weak retrieval lost to a smaller model with hybrid search and explicit “I cannot answer” behavior.
Operations after launch
Content operations owns index freshness: forty-eight hours for critical policy changes, five business days for minor articles. Engineering owns eval regression before each provider model upgrade.
Phase two adds fine-tuned summarization tone for agent drafts while RAG remains the source of policy truth.
Lessons for executives
The business sponsor attended weekly eval reviews — not only the launch demo. That kept investment tied to measurable deflection and time saved, not novelty.
Legal and security signed off because citations and refusal were non-negotiable requirements in phase one, not backlog items.
Technical debt avoided
The team resisted one-off scripts per data source. Connectors share metadata schema and monitoring, so adding SharePoint later did not rewrite Confluence ingestion.
Model upgrades run through the same golden set used at pilot — preventing “it worked last month” surprises.
Replication timeline for your team
Month one: golden questions and corpus cut. Month two: retrieval quality on held-out set. Month three: pilot UI and escalation. Month four: hardening and executive readout with before/after metrics.
Skip the temptation to index every drive folder day one — authority and freshness beat volume.
Frequently Asked Questions
- Zendesk and Shopify; the pattern transfers to Intercom or Freshdesk.
- Faithfulness eval, retrieval thresholds, and mandatory escalation paths.
- Not in phase one; planned for tone in phase two only.