Building RAG Systems That Actually Work: Beyond the Hello World Tutorial

Every week a founder shows us a LangChain notebook that answers questions about their PDF pitch deck perfectly - and fails on the 40,000-page policy corpus in production. Vextrosys has deployed RAG for legal research, internal IT support, and clinical guideline assistants. The gap between demo and production is retrieval quality, evaluation, and operational guardrails, not prompt engineering alone.

Retrieval quality dominates generation quality

If the right chunks are not in the top-k context window, no system prompt saves you. We spend 60-70% of project effort on ingestion, chunking, indexing, and evaluation before tuning the LLM. Generation is the easy part once recall@10 exceeds 0.85 on your golden question set.

Metric we ship with

Answer faithfulness and context recall measured with RAGAS or custom LLM-as-judge pipelines on 200+ labeled Q&A pairs per domain. No production launch without a regression suite.

Chunking: structure beats fixed token windows

Legal/regulatory: chunk by section/clause hierarchy, preserve citation metadata
Technical docs: split on H2/H3, keep code blocks intact, attach breadcrumb titles
Tickets/CRM: one chunk per resolution thread with timestamps and product version
Tables: markdown serialization or HTML row groups - never split mid-row without headers

# Hierarchical chunk metadata example
{
  "doc_id": "policy-2024-claims",
  "section_path": ["Chapter 4", "4.2 Prior Authorization"],
  "chunk_index": 12,
  "effective_date": "2024-01-01"
}

Chunk size tradeoffs

512-token chunks with 64-token overlap is a mediocre default. Dense legal text often needs 256 tokens; conversational support docs work at 768. We run offline grid search on chunk size vs. recall@5 using labeled queries before locking pipeline config.

Embeddings and hybrid search

We default to text-embedding-3-large or Cohere embed-v3 for English enterprise; multilingual projects use voyage-3 or BGE-M3. Pure vector search fails on SKU codes, case numbers, and exact policy IDs - hybrid BM25 + dense with reciprocal rank fusion (RRF) is standard in our stacks.

Ingest → normalize → chunk → embed → index (OpenSearch k-NN or pgvector)
Query → rewrite (optional HyDE or multi-query) → hybrid retrieve top 50
Re-rank to top 8 with cross-encoder (Cohere rerank-v3 or bge-reranker-large)
Assemble context with token budget + citation markers → LLM generate
Post-check: citation coverage, refusal if confidence low

// RRF score fusion (simplified)
score(d) = sum over rank_lists: 1 / (k + rank_i(d))  // k=60 typical

Query understanding and routing

Not every user question should hit the same index. We classify intent: factual lookup vs. summarization vs. procedural how-to. Summarization routes to map-reduce over section summaries precomputed nightly. Factual lookups use tight top-k and mandatory citations. Out-of-domain queries hit a refusal path trained on negative examples.

HyDE helps on abstract questions but hallucinates retrieval queries on numeric lookups. We enable it per intent class, not globally.

Grounding, citations, and hallucination control

Force inline [1][2] citations mapped to chunk IDs shown in UI
Temperature 0-0.2 for factual modes; higher only for drafting assistants
Self-check pass: second LLM call verifies each claim appears in provided context
Block answers when max rerank score below threshold - prefer 'I don't know'

Healthcare deployment note

PHI never enters embedding APIs without BAA-covered endpoints. We run local embedding models on GPU instances for one client; latency tradeoff accepted for compliance.

Evaluation and continuous improvement

Golden datasets are living documents. Analysts flag bad answers in production UI; labels flow to weekly eval runs. We track retrieval recall, answer correctness, and latency percentiles per cohort. Prompt and index versions are pinned in MLflow; rollbacks are one click.

Cost and latency

Re-ranking 50 chunks adds 80-120ms but cuts hallucinations measurably. Cache embedding lookups for frequent queries. Use smaller models (GPT-4o-mini, Claude Haiku) for routing and extraction; reserve frontier models for final synthesis on complex threads only.

Build eval set before building UI - stakeholders underestimate this effort
Invest in metadata filters (date, product, jurisdiction) early
Treat the index as a product: re-ingestion on source updates with versioning
Plan for PII redaction at ingest, not at query time

Production RAG is a search problem with an LLM frontend. Teams that win treat retrieval, evaluation, and governance as first-class engineering work - the notebook demo is step zero, not step nine.

Building RAG Systems That Actually Work: Beyond the Hello World Tutorial

Retrieval quality dominates generation quality

Chunking: structure beats fixed token windows

Chunk size tradeoffs

Embeddings and hybrid search

Query understanding and routing

Grounding, citations, and hallucination control

Evaluation and continuous improvement

Cost and latency

Continue reading

How We Built a Real-Time Fraud Detection System That Processes 50,000 TPS

Flutter vs React Native in 2026: An Honest Comparison After 50+ Projects

Multi-Tenant SaaS Architecture: The Decisions That Matter Most at Scale