RAG Explained: Turn AI Models Into Grounded Experts

In partnership with

Fast browsing. Faster thinking.

Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.

Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.

Fast. Safe. Intelligent. That's Neo.

Download Norton Neo

❝

Added Job Openings in the end of the article!

❝

Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.

ChatGPT knows nothing about your company's internal docs. Claude has never read your codebase. Gemini cannot search your private Notion workspace.

LLMs are trained on public internet data up to a cutoff date. After that, they are frozen. They cannot access your data, they cannot learn new facts, and when they do not know something, they make it up confidently. This is called hallucination.

RAG fixes this. Retrieval Augmented Generation is the architecture pattern that connects an LLM to YOUR data so it can answer questions about things it was never trained on, without hallucinating.

The Open Book Exam Analogy

Think of an LLM without RAG as a closed book exam. The student (the model) must answer everything from memory. If they do not remember, they guess. Sometimes the guess sounds convincing but is completely wrong.

RAG turns it into an open book exam. Before answering, the student looks up the relevant pages in the textbook, reads them, and then writes an answer based on what they just read. The student still needs intelligence to synthesize and reason, but they no longer need to memorize everything.

The "textbook" is your data: internal docs, PDFs, knowledge bases, codebases, Confluence pages, support tickets, product catalogs, anything.

How RAG Works: Two Phases

RAG has two distinct phases. One runs offline (once). The other runs online (every query).

Phase 1: Ingestion (Offline, Runs Once)

This phase prepares your data for fast retrieval.

Step 1: Load documents. Collect your raw data: PDFs, Word files, web pages, database records, markdown files, Slack messages. Anything the LLM should know about.

Step 2: Chunk the documents. Split each document into smaller pieces, typically 300 to 600 tokens each with a small overlap between chunks. Why? Because LLMs have context window limits and vector search works better on focused passages than entire documents.

Step 3: Create embeddings. Pass each chunk through an embedding model (like OpenAI's text-embedding-3-small or open source alternatives like bge-large). The model converts each text chunk into a numerical vector, a list of numbers that represents the meaning of that text. Similar meanings produce similar vectors.

Step 4: Store in a vector database. Save all the vectors and their original text chunks into a vector database (Pinecone, Weaviate, Qdrant, Chroma, or PostgreSQL with pgvector). This database is optimized for one operation: finding the vectors most similar to a given query vector.

# Phase 1: Ingestion (simplified)
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Step 1: Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()

# Step 2: Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Step 3 and 4: Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

Phase 2: Query (Online, Every Request)

This phase runs every time a user asks a question.

❝

💡 Key Insight: The entire magic of RAG happens in a single sentence: "Find the most relevant chunks from my data, paste them into the prompt alongside the user's question, and let the LLM answer using that context." That is it. Everything else is optimization.

Step 1: Embed the query. The user's question gets converted into a vector using the same embedding model used during ingestion.

Step 2: Retrieve relevant chunks. The vector database performs a similarity search: it finds the top K chunks (typically 3 to 5) whose vectors are closest to the query vector. "Closest" means most semantically similar, not keyword matched. Asking "how do we handle authentication?" will match a chunk about "user login flow with JWT tokens" even though the words are different.

Step 3: Augment the prompt. The retrieved chunks are injected into the LLM's prompt as context, alongside the original question. The prompt looks something like:

You are a helpful assistant. Answer the question using ONLY the 
context provided below. If the context does not contain the answer, 
say "I don't know."

Context:
{chunk_1_text}
{chunk_2_text}
{chunk_3_text}

Question: {user_question}

Step 4: Generate the answer. The LLM reads the context and the question, then generates an answer grounded in the retrieved data. Because the answer is based on real documents, hallucination drops dramatically.

# Phase 2: Query (simplified)
query = "How does our authentication service work?"

# Step 1 and 2: Embed query, retrieve relevant chunks
results = vectorstore.similarity_search(query, k=3)
context = "\n".join([doc.page_content for doc in results])

# Step 3 and 4: Augment and generate
prompt = f"""Answer using ONLY this context:
{context}

Question: {query}"""

response = llm.invoke(prompt)

RAG vs Fine Tuning: When to Use Which

Factor	RAG	Fine Tuning
Data changes frequently	Yes (just re index)	No (retrain every time)
Need source attribution	Yes (you know which chunks were used)	No
Cost	Lower (no GPU training)	Higher (GPU hours for training)
Setup complexity	Medium (vector DB + retriever)	Higher (training pipeline)
Hallucination reduction	Strong (grounded in retrieved docs)	Moderate (still generates from memory)
Best for	Q&A over docs, support bots, code search	Changing the model's tone, style, or domain vocabulary

The rule of thumb: If your data changes more than once a month, use RAG. If you need the model to think differently (not just know differently), fine tune. Most production systems use RAG first and add fine tuning later if needed.

Where RAG Breaks (and How to Fix It)

Naive RAG works for demos. It fails in production roughly 40% of the time because retrieval is the bottleneck, not generation.

Problem	Why it happens	Fix
Wrong chunks retrieved	Semantic gap between query and document language	Hybrid search: combine vector similarity with keyword (BM25) search
Chunks too generic	Large chunks dilute the relevant information	Smaller chunks (200 to 300 tokens) with more overlap
Answer uses wrong context	Top K retrieval includes irrelevant results	Reranking: use a cross encoder model to re score retrieved chunks before passing to LLM
Stale data	Documents changed but embeddings were not updated	Incremental indexing: re embed only changed documents on a schedule
Complex multi hop questions	Answer requires combining info from multiple documents	Agentic RAG: let an AI agent decompose the question, retrieve iteratively, and synthesize

The RAG Stack in 2026

Component	Popular choices
Embedding model	OpenAI `text-embedding-3-small`, Cohere Embed, BGE, Jina
Vector database	Pinecone, Weaviate, Qdrant, Chroma, pgvector (PostgreSQL)
Orchestration	LangChain, LlamaIndex, Haystack
LLM	Claude, GPT 4, Gemini, Llama, Mistral
Reranker	Cohere Rerank, cross encoder models

What This Means For Engineers

RAG is not an AI/ML skill. It is a backend engineering skill. The core of RAG is data ingestion, chunking, indexing, and retrieval. These are the same skills you use to build search engines and data pipelines. If you can build a REST API with a database, you can build a RAG pipeline.
Start with pgvector before reaching for a dedicated vector database. If you already use PostgreSQL, the pgvector extension adds vector similarity search to your existing database. No new infrastructure. No new ops burden. Scale to a dedicated vector DB only when pgvector becomes the bottleneck.
Retrieval quality matters more than model quality. Swapping GPT 3.5 for GPT 4 improves RAG answers by maybe 10 to 15%. Improving your retrieval (better chunking, hybrid search, reranking) improves answers by 30 to 50%. Invest in retrieval first.

❝

Job Openings

Software Engineer I, Data Platform @Uber: Apply Here
Software Engineer II, Web/Frontend, Membership @Uber: Apply Here
Software Engineer II, U4B Platforms @Uber: Apply Here
Software Development Engineer, Device Management Systems @Amazon: Apply Here
Software Engineering AMTS (Batch 2026) @Salesforce: Apply Here

Follow me on Youtube · LinkedIn · X · Instagram to stay updated.

See you in the next one!
Signing Off, Scortier

What is RAG (Retrieval Augmented Generation) ?

Fast browsing. Faster thinking.

The Open Book Exam Analogy

How RAG Works: Two Phases

Phase 1: Ingestion (Offline, Runs Once)

Phase 2: Query (Online, Every Request)

RAG vs Fine Tuning: When to Use Which

Where RAG Breaks (and How to Fix It)

The RAG Stack in 2026

What This Means For Engineers

Did you enjoy this edition ?

Reply

Keep Reading

Subscribe to Grind Engineer

What is RAG (Retrieval Augmented Generation) ?

Fast browsing. Faster thinking.

The Open Book Exam Analogy

How RAG Works: Two Phases

Phase 1: Ingestion (Offline, Runs Once)

Phase 2: Query (Online, Every Request)

RAG vs Fine Tuning: When to Use Which

Where RAG Breaks (and How to Fix It)

The RAG Stack in 2026

What This Means For Engineers

Did you enjoy this edition ?

Subscribe to keep reading

Reply

Keep Reading

Subscribe to Grind Engineer