RAG Systems · 12 min read

Building a RAG System That Survives Production

A practical guide to retrieval architecture, evaluation, and reliability for production RAG systems.

Start with a business grade target

A production RAG system is not a demo that answers a few prompts. It is a dependable workflow that supports decisions, customer support, or internal operations. Start by defining the business target in plain language. What is the task, who owns it, and what does success look like.

Good targets are measurable. Example targets include reducing support resolution time by 20 percent, improving first response accuracy to 85 percent, or enabling sales teams to retrieve verified answers in under three seconds.

Data quality is the real model

Most RAG failures come from the data layer. Start with data inventory. Identify source systems, owners, update cadence, and access permissions. Create a single source of truth for each domain. If the knowledge base is fragmented, the model will be confused.

Normalize content before indexing. Remove duplicate documents, fix broken headers, and enforce consistent metadata. The goal is to make retrieval a precision task, not a guessing game.

Define document owners and update cadence
Remove duplicates and stale content
Standardize metadata such as topic, product, and region
Track PII and sensitive fields upfront

Chunking and indexing choices matter

Chunk size is a control knob for recall and precision. Small chunks improve recall but can dilute context. Large chunks preserve context but can bury the answer. Start with 300 to 700 tokens and tune with evaluation results.

Use metadata filters for precision. A contract clause should not compete with a blog post. Metadata filters keep retrieval relevant and reduce hallucinations.

Retrieval should be hybrid by default

Dense retrieval is powerful, but lexical retrieval is still essential for exact matches, codes, and specific terms. Hybrid retrieval gives you the best of both. Add a lightweight re ranking step for better ordering.

Hybrid retrieval with dense and lexical signals
Re rank top 20 to top 5 for quality
Use query rewriting for ambiguous user input

Build an evaluation harness early

Evaluation is the difference between confidence and hope. Build a small test set of real questions with verified answers. Track accuracy, context precision, and failure modes. Update the set every time the system changes.

Score both the retrieved context and the final answer. If the context is wrong, the answer is likely wrong even if it sounds fluent.

Observability and feedback loops

Once the system is live, log queries, retrieved documents, model output, and user feedback. Review weekly. Create a data quality backlog and treat it like product work.

Include a feedback mechanism in the UI. A single thumbs down can reveal a missing document or broken ingestion pipeline.

Security and compliance are not optional

RAG systems often touch sensitive data. Apply access control at retrieval time, not just at the UI. Enforce tenant isolation. Audit data exports. Use encryption in transit and at rest.

Latency and cost budgeting

Latency budgets should be explicit. Aim for retrieval under 500ms, reranking under 200ms, and model response under two seconds for most workflows. Cache popular queries and pre compute embeddings for high value collections.

Go live checklist

Clear success metrics and owners
Data inventory and access control in place
Hybrid retrieval with tuned chunking
Evaluation harness and weekly review cadence
Monitoring for latency and accuracy
Rollback plan and error handling

Back to Insights