In partnership with

Fast browsing. Faster thinking.

Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.

Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.

Fast. Safe. Intelligent. That's Neo.

Added Job Openings in the end of the article!

Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.

ChatGPT knows nothing about your company's internal docs. Claude has never read your codebase. Gemini cannot search your private Notion workspace.

LLMs are trained on public internet data up to a cutoff date. After that, they are frozen. They cannot access your data, they cannot learn new facts, and when they do not know something, they make it up confidently. This is called hallucination.

RAG fixes this. Retrieval Augmented Generation is the architecture pattern that connects an LLM to YOUR data so it can answer questions about things it was never trained on, without hallucinating.

The Open Book Exam Analogy

Think of an LLM without RAG as a closed book exam. The student (the model) must answer everything from memory. If they do not remember, they guess. Sometimes the guess sounds convincing but is completely wrong.

RAG turns it into an open book exam. Before answering, the student looks up the relevant pages in the textbook, reads them, and then writes an answer based on what they just read. The student still needs intelligence to synthesize and reason, but they no longer need to memorize everything.

The "textbook" is your data: internal docs, PDFs, knowledge bases, codebases, Confluence pages, support tickets, product catalogs, anything.

How RAG Works: Two Phases

RAG has two distinct phases. One runs offline (once). The other runs online (every query).

Phase 1: Ingestion (Offline, Runs Once)

This phase prepares your data for fast retrieval.

Step 1: Load documents. Collect your raw data: PDFs, Word files, web pages, database records, markdown files, Slack messages. Anything the LLM should know about.

Step 2: Chunk the documents. Split each document into smaller pieces, typically 300 to 600 tokens each with a small overlap between chunks. Why? Because LLMs have context window limits and vector search works better on focused passages than entire documents.

Step 3: Create embeddings. Pass each chunk through an embedding model (like OpenAI's text-embedding-3-small or open source alternatives like bge-large). The model converts each text chunk into a numerical vector, a list of numbers that represents the meaning of that text. Similar meanings produce similar vectors.

Step 4: Store in a vector database. Save all the vectors and their original text chunks into a vector database (Pinecone, Weaviate, Qdrant, Chroma, or PostgreSQL with pgvector). This database is optimized for one operation: finding the vectors most similar to a given query vector.

# Phase 1: Ingestion (simplified)
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Step 1: Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()

# Step 2: Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Step 3 and 4: Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")

Phase 2: Query (Online, Every Request)

This phase runs every time a user asks a question.

💡 Key Insight: The entire magic of RAG happens in a single sentence: "Find the most relevant chunks from my data, paste them into the prompt alongside the user's question, and let the LLM answer using that context." That is it. Everything else is optimization.

Step 1: Embed the query. The user's question gets converted into a vector using the same embedding model used during ingestion.

Step 2: Retrieve relevant chunks. The vector database performs a similarity search: it finds the top K chunks (typically 3 to 5) whose vectors are closest to the query vector. "Closest" means most semantically similar, not keyword matched. Asking "how do we handle authentication?" will match a chunk about "user login flow with JWT tokens" even though the words are different.

Step 3: Augment the prompt. The retrieved chunks are injected into the LLM's prompt as context, alongside the original question. The prompt looks something like:

You are a helpful assistant. Answer the question using ONLY the 
context provided below. If the context does not contain the answer, 
say "I don't know."

Context:
{chunk_1_text}
{chunk_2_text}
{chunk_3_text}

Question: {user_question}

Step 4: Generate the answer. The LLM reads the context and the question, then generates an answer grounded in the retrieved data. Because the answer is based on real documents, hallucination drops dramatically.

# Phase 2: Query (simplified)
query = "How does our authentication service work?"

# Step 1 and 2: Embed query, retrieve relevant chunks
results = vectorstore.similarity_search(query, k=3)
context = "\n".join([doc.page_content for doc in results])

# Step 3 and 4: Augment and generate
prompt = f"""Answer using ONLY this context:
{context}

Question: {query}"""

response = llm.invoke(prompt)

RAG vs Fine Tuning: When to Use Which

Factor

RAG

Fine Tuning

Data changes frequently

Yes (just re index)

No (retrain every time)

Need source attribution

Yes (you know which chunks were used)

No

Cost

Lower (no GPU training)

Higher (GPU hours for training)

Setup complexity

Medium (vector DB + retriever)

Higher (training pipeline)

Hallucination reduction

Strong (grounded in retrieved docs)

Moderate (still generates from memory)

Best for

Q&A over docs, support bots, code search

Changing the model's tone, style, or domain vocabulary

The rule of thumb: If your data changes more than once a month, use RAG. If you need the model to think differently (not just know differently), fine tune. Most production systems use RAG first and add fine tuning later if needed.

Where RAG Breaks (and How to Fix It)

Naive RAG works for demos. It fails in production roughly 40% of the time because retrieval is the bottleneck, not generation.

Problem

Why it happens

Fix

Wrong chunks retrieved

Semantic gap between query and document language

Hybrid search: combine vector similarity with keyword (BM25) search

Chunks too generic

Large chunks dilute the relevant information

Smaller chunks (200 to 300 tokens) with more overlap

Answer uses wrong context

Top K retrieval includes irrelevant results

Reranking: use a cross encoder model to re score retrieved chunks before passing to LLM

Stale data

Documents changed but embeddings were not updated

Incremental indexing: re embed only changed documents on a schedule

Complex multi hop questions

Answer requires combining info from multiple documents

Agentic RAG: let an AI agent decompose the question, retrieve iteratively, and synthesize

The RAG Stack in 2026

Component

Popular choices

Embedding model

OpenAI text-embedding-3-small, Cohere Embed, BGE, Jina

Vector database

Pinecone, Weaviate, Qdrant, Chroma, pgvector (PostgreSQL)

Orchestration

LangChain, LlamaIndex, Haystack

LLM

Claude, GPT 4, Gemini, Llama, Mistral

Reranker

Cohere Rerank, cross encoder models

What This Means For Engineers

  1. RAG is not an AI/ML skill. It is a backend engineering skill. The core of RAG is data ingestion, chunking, indexing, and retrieval. These are the same skills you use to build search engines and data pipelines. If you can build a REST API with a database, you can build a RAG pipeline.

  2. Start with pgvector before reaching for a dedicated vector database. If you already use PostgreSQL, the pgvector extension adds vector similarity search to your existing database. No new infrastructure. No new ops burden. Scale to a dedicated vector DB only when pgvector becomes the bottleneck.

  3. Retrieval quality matters more than model quality. Swapping GPT 3.5 for GPT 4 improves RAG answers by maybe 10 to 15%. Improving your retrieval (better chunking, hybrid search, reranking) improves answers by 30 to 50%. Invest in retrieval first.

Job Openings

  • Software Engineer I, Data Platform @Uber: Apply Here

  • Software Engineer II, Web/Frontend, Membership @Uber: Apply Here

  • Software Engineer II, U4B Platforms @Uber: Apply Here

  • Software Development Engineer, Device Management Systems @Amazon: Apply Here

  • Software Engineering AMTS (Batch 2026) @Salesforce: Apply Here

Follow me on Youtube · LinkedIn · X · Instagram to stay updated.

See you in the next one!
Signing Off, Scortier

Login or Subscribe to participate

Subscribe to keep reading

This content is free, but you must be subscribed to Grind Engineer to continue reading.

Already a subscriber?Sign in.Not now

Reply

Avatar

or to participate

Keep Reading