Fast browsing. Faster thinking.
Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.
Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.
Fast. Safe. Intelligent. That's Neo.
Added Job Openings in the end of the article!
Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.
ChatGPT knows nothing about your company's internal docs. Claude has never read your codebase. Gemini cannot search your private Notion workspace.
LLMs are trained on public internet data up to a cutoff date. After that, they are frozen. They cannot access your data, they cannot learn new facts, and when they do not know something, they make it up confidently. This is called hallucination.
RAG fixes this. Retrieval Augmented Generation is the architecture pattern that connects an LLM to YOUR data so it can answer questions about things it was never trained on, without hallucinating.

The Open Book Exam Analogy
Think of an LLM without RAG as a closed book exam. The student (the model) must answer everything from memory. If they do not remember, they guess. Sometimes the guess sounds convincing but is completely wrong.
RAG turns it into an open book exam. Before answering, the student looks up the relevant pages in the textbook, reads them, and then writes an answer based on what they just read. The student still needs intelligence to synthesize and reason, but they no longer need to memorize everything.
The "textbook" is your data: internal docs, PDFs, knowledge bases, codebases, Confluence pages, support tickets, product catalogs, anything.
How RAG Works: Two Phases
RAG has two distinct phases. One runs offline (once). The other runs online (every query).
Phase 1: Ingestion (Offline, Runs Once)
This phase prepares your data for fast retrieval.
Step 1: Load documents. Collect your raw data: PDFs, Word files, web pages, database records, markdown files, Slack messages. Anything the LLM should know about.
Step 2: Chunk the documents. Split each document into smaller pieces, typically 300 to 600 tokens each with a small overlap between chunks. Why? Because LLMs have context window limits and vector search works better on focused passages than entire documents.
Step 3: Create embeddings. Pass each chunk through an embedding model (like OpenAI's text-embedding-3-small or open source alternatives like bge-large). The model converts each text chunk into a numerical vector, a list of numbers that represents the meaning of that text. Similar meanings produce similar vectors.
Step 4: Store in a vector database. Save all the vectors and their original text chunks into a vector database (Pinecone, Weaviate, Qdrant, Chroma, or PostgreSQL with pgvector). This database is optimized for one operation: finding the vectors most similar to a given query vector.
# Phase 1: Ingestion (simplified)
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Step 1: Load documents
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
# Step 2: Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Step 3 and 4: Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")Phase 2: Query (Online, Every Request)
This phase runs every time a user asks a question.
💡 Key Insight: The entire magic of RAG happens in a single sentence: "Find the most relevant chunks from my data, paste them into the prompt alongside the user's question, and let the LLM answer using that context." That is it. Everything else is optimization.
Step 1: Embed the query. The user's question gets converted into a vector using the same embedding model used during ingestion.
Step 2: Retrieve relevant chunks. The vector database performs a similarity search: it finds the top K chunks (typically 3 to 5) whose vectors are closest to the query vector. "Closest" means most semantically similar, not keyword matched. Asking "how do we handle authentication?" will match a chunk about "user login flow with JWT tokens" even though the words are different.
Step 3: Augment the prompt. The retrieved chunks are injected into the LLM's prompt as context, alongside the original question. The prompt looks something like:
You are a helpful assistant. Answer the question using ONLY the
context provided below. If the context does not contain the answer,
say "I don't know."
Context:
{chunk_1_text}
{chunk_2_text}
{chunk_3_text}
Question: {user_question}Step 4: Generate the answer. The LLM reads the context and the question, then generates an answer grounded in the retrieved data. Because the answer is based on real documents, hallucination drops dramatically.
# Phase 2: Query (simplified)
query = "How does our authentication service work?"
# Step 1 and 2: Embed query, retrieve relevant chunks
results = vectorstore.similarity_search(query, k=3)
context = "\n".join([doc.page_content for doc in results])
# Step 3 and 4: Augment and generate
prompt = f"""Answer using ONLY this context:
{context}
Question: {query}"""
response = llm.invoke(prompt)RAG vs Fine Tuning: When to Use Which
Factor | RAG | Fine Tuning |
|---|---|---|
Data changes frequently | Yes (just re index) | No (retrain every time) |
Need source attribution | Yes (you know which chunks were used) | No |
Cost | Lower (no GPU training) | Higher (GPU hours for training) |
Setup complexity | Medium (vector DB + retriever) | Higher (training pipeline) |
Hallucination reduction | Strong (grounded in retrieved docs) | Moderate (still generates from memory) |
Best for | Q&A over docs, support bots, code search | Changing the model's tone, style, or domain vocabulary |
The rule of thumb: If your data changes more than once a month, use RAG. If you need the model to think differently (not just know differently), fine tune. Most production systems use RAG first and add fine tuning later if needed.
Where RAG Breaks (and How to Fix It)
Naive RAG works for demos. It fails in production roughly 40% of the time because retrieval is the bottleneck, not generation.
Problem | Why it happens | Fix |
|---|---|---|
Wrong chunks retrieved | Semantic gap between query and document language | Hybrid search: combine vector similarity with keyword (BM25) search |
Chunks too generic | Large chunks dilute the relevant information | Smaller chunks (200 to 300 tokens) with more overlap |
Answer uses wrong context | Top K retrieval includes irrelevant results | Reranking: use a cross encoder model to re score retrieved chunks before passing to LLM |
Stale data | Documents changed but embeddings were not updated | Incremental indexing: re embed only changed documents on a schedule |
Complex multi hop questions | Answer requires combining info from multiple documents | Agentic RAG: let an AI agent decompose the question, retrieve iteratively, and synthesize |
The RAG Stack in 2026
Component | Popular choices |
|---|---|
Embedding model | OpenAI |
Vector database | Pinecone, Weaviate, Qdrant, Chroma, pgvector (PostgreSQL) |
Orchestration | LangChain, LlamaIndex, Haystack |
LLM | Claude, GPT 4, Gemini, Llama, Mistral |
Reranker | Cohere Rerank, cross encoder models |
What This Means For Engineers
RAG is not an AI/ML skill. It is a backend engineering skill. The core of RAG is data ingestion, chunking, indexing, and retrieval. These are the same skills you use to build search engines and data pipelines. If you can build a REST API with a database, you can build a RAG pipeline.
Start with pgvector before reaching for a dedicated vector database. If you already use PostgreSQL, the
pgvectorextension adds vector similarity search to your existing database. No new infrastructure. No new ops burden. Scale to a dedicated vector DB only when pgvector becomes the bottleneck.Retrieval quality matters more than model quality. Swapping GPT 3.5 for GPT 4 improves RAG answers by maybe 10 to 15%. Improving your retrieval (better chunking, hybrid search, reranking) improves answers by 30 to 50%. Invest in retrieval first.
Job Openings
Software Engineer I, Data Platform @Uber: Apply Here
Software Engineer II, Web/Frontend, Membership @Uber: Apply Here
Software Engineer II, U4B Platforms @Uber: Apply Here
Software Development Engineer, Device Management Systems @Amazon: Apply Here
Software Engineering AMTS (Batch 2026) @Salesforce: Apply Here
See you in the next one!
Signing Off, Scortier



