Every week a founder shows us a LangChain notebook that answers questions about their PDF pitch deck perfectly - and fails on the 40,000-page policy corpus in production. Vextrosys has deployed RAG for legal research, internal IT support, and clinical guideline assistants. The gap between demo and production is retrieval quality, evaluation, and operational guardrails, not prompt engineering alone.
Retrieval quality dominates generation quality
If the right chunks are not in the top-k context window, no system prompt saves you. We spend 60-70% of project effort on ingestion, chunking, indexing, and evaluation before tuning the LLM. Generation is the easy part once recall@10 exceeds 0.85 on your golden question set.
Metric we ship with
Answer faithfulness and context recall measured with RAGAS or custom LLM-as-judge pipelines on 200+ labeled Q&A pairs per domain. No production launch without a regression suite.
Chunking: structure beats fixed token windows
- Legal/regulatory: chunk by section/clause hierarchy, preserve citation metadata
- Technical docs: split on H2/H3, keep code blocks intact, attach breadcrumb titles
- Tickets/CRM: one chunk per resolution thread with timestamps and product version
- Tables: markdown serialization or HTML row groups - never split mid-row without headers
# Hierarchical chunk metadata example
{
"doc_id": "policy-2024-claims",
"section_path": ["Chapter 4", "4.2 Prior Authorization"],
"chunk_index": 12,
"effective_date": "2024-01-01"
}Chunk size tradeoffs
512-token chunks with 64-token overlap is a mediocre default. Dense legal text often needs 256 tokens; conversational support docs work at 768. We run offline grid search on chunk size vs. recall@5 using labeled queries before locking pipeline config.
Embeddings and hybrid search
We default to text-embedding-3-large or Cohere embed-v3 for English enterprise; multilingual projects use voyage-3 or BGE-M3. Pure vector search fails on SKU codes, case numbers, and exact policy IDs - hybrid BM25 + dense with reciprocal rank fusion (RRF) is standard in our stacks.
- Ingest → normalize → chunk → embed → index (OpenSearch k-NN or pgvector)
- Query → rewrite (optional HyDE or multi-query) → hybrid retrieve top 50
- Re-rank to top 8 with cross-encoder (Cohere rerank-v3 or bge-reranker-large)
- Assemble context with token budget + citation markers → LLM generate
- Post-check: citation coverage, refusal if confidence low
// RRF score fusion (simplified)
score(d) = sum over rank_lists: 1 / (k + rank_i(d)) // k=60 typicalQuery understanding and routing
Not every user question should hit the same index. We classify intent: factual lookup vs. summarization vs. procedural how-to. Summarization routes to map-reduce over section summaries precomputed nightly. Factual lookups use tight top-k and mandatory citations. Out-of-domain queries hit a refusal path trained on negative examples.
HyDE helps on abstract questions but hallucinates retrieval queries on numeric lookups. We enable it per intent class, not globally.
Grounding, citations, and hallucination control
- Force inline [1][2] citations mapped to chunk IDs shown in UI
- Temperature 0-0.2 for factual modes; higher only for drafting assistants
- Self-check pass: second LLM call verifies each claim appears in provided context
- Block answers when max rerank score below threshold - prefer 'I don't know'
Healthcare deployment note
PHI never enters embedding APIs without BAA-covered endpoints. We run local embedding models on GPU instances for one client; latency tradeoff accepted for compliance.
Evaluation and continuous improvement
Golden datasets are living documents. Analysts flag bad answers in production UI; labels flow to weekly eval runs. We track retrieval recall, answer correctness, and latency percentiles per cohort. Prompt and index versions are pinned in MLflow; rollbacks are one click.
Cost and latency
Re-ranking 50 chunks adds 80-120ms but cuts hallucinations measurably. Cache embedding lookups for frequent queries. Use smaller models (GPT-4o-mini, Claude Haiku) for routing and extraction; reserve frontier models for final synthesis on complex threads only.
- Build eval set before building UI - stakeholders underestimate this effort
- Invest in metadata filters (date, product, jurisdiction) early
- Treat the index as a product: re-ingestion on source updates with versioning
- Plan for PII redaction at ingest, not at query time
Production RAG is a search problem with an LLM frontend. Teams that win treat retrieval, evaluation, and governance as first-class engineering work - the notebook demo is step zero, not step nine.