Table of Contents
- 1. What Is RAG --- Retrieval-Augmented Generation
- 2. Why You Need RAG --- 3 Limits of LLMs Alone
- 3. How It Works --- RAG in 3 Steps
- 4. The Key Components of a RAG System
- 5. What Is a Vector Database?
- 6. Where RAG Is Actually Used
- 7. RAG vs Fine-Tuning --- Which Should You Pick?
- 8. How to Build It --- A RAG with LangChain
- 9. RAG Challenges and How to Handle Them
- 10. Major Tools and Services
- FAQ
"I want to load our employee handbook into ChatGPT and have it answer questions from staff automatically." "I need it to search the latest research database and summarize papers." --- demands like these have been growing fast. But ChatGPT's training data is frozen at some point in the past, and you obviously can't just hand confidential internal documents over to an AI for training.
The technique that solves this problem is RAG (Retrieval-Augmented Generation). Since 2023, it has become one of the most important keywords in enterprise AI, and ChatGPT's "Custom GPTs" and "Projects" features actually use RAG under the hood.
This article breaks down RAG in three diagrammed steps and walks through vector databases, a LangChain implementation, and when to choose RAG over fine-tuning --- at a level a beginner can follow, but technically accurate enough to be useful.
1. What Is RAG --- Retrieval-Augmented Generation
RAG (Retrieval-Augmented Generation) literally means "generation that has been augmented by retrieval."
In one sentence: it's a system where, "before the LLM (large language model) produces an answer, it searches an external database for relevant information and uses those search results to inform its response."
A Cooking Analogy
An LLM on its own is "a chef who cooks from memory." Talented, sure, but they can't make a dish they don't know, and they have no idea what's in your fridge.
RAG is the system that "hands the chef a cookbook and tells them what's in the fridge before they start cooking." Now the chef can flip through the recipes and use the ingredients on hand to put together the right dish.
What "Retrieval," "Augmented," and "Generation" Each Do
| Word | Meaning | Role in RAG |
|---|---|---|
| Retrieval | Search / fetch | Pull documents related to the question out of a database |
| Augmented | Extended / enhanced | Add the retrieved information to the prompt that goes to the LLM |
| Generation | Generate | The LLM produces an answer with the search results in front of it |
The key idea: instead of retraining the LLM itself, you hand it the "knowledge it needs" from the outside, every time a question comes in. That's the fundamental difference between RAG and fine-tuning, which we'll cover later.
2. Why You Need RAG --- 3 Limits of LLMs Alone
There are three problems that ChatGPT, Claude, and other LLMs can't solve on their own.
Limit 1: The Knowledge Cutoff (How Fresh the Information Is)
LLMs are trained on "data up to a certain point," so they don't know anything that happened after training. The early version of GPT-4, for instance, only had information up to April 2023.
- "Tell me about the new product announced yesterday." -> Can't answer
- "What was in the legislation that passed last week?" -> Can't answer
- "What's today's exchange rate?" -> Can't answer
RAG lets you pull from the latest news, databases, and APIs to actually respond.
Limit 2: Hallucinations (Plausible-Sounding Lies)
When you ask an LLM about something it doesn't know, it has a strong tendency to make up an answer that looks reasonable. This is called hallucination.
Example: ask "How many days of paid leave does our company give?" and the LLM, which has no idea, will reply with something like "typically 10 to 20 days." That's worse than useless in a business context.
With RAG, the system searches the actual employee handbook and uses it as a reference, which gives you answers grounded in real evidence. You can even attach the citation --- "this is in document X, page Y."
Limit 3: No Access to Internal or Private Data
An LLM's training data does not include your company's manuals, contracts, or customer records. And you can't just hand confidential information over for training (data leakage risk, cost, and so on).
With RAG, you store internal documents in your own vector DB and only pull out the relevant chunks when a question comes in --- so you can tap into internal data while keeping security intact.
3. How It Works --- RAG in 3 Steps
RAG operates in two broad phases: "preparation (indexing)" and "runtime (Q&A)."
Preparation Phase --- Vectorize and Store Your Documents
- Collect documents: gather the PDFs, Word files, HTML, Markdown, or whatever else you want to use
- Chunk them: split the documents into reasonable lengths (say, 500-1000 characters)
- Embed them: run each chunk through an embedding model (e.g., OpenAI text-embedding-3-small) to convert it into a vector --- an array of numbers, often 1536-dimensional
- Store in a vector DB: save the chunks alongside their vectors in a dedicated database (Pinecone, Qdrant, etc.)
You run this whenever new documents are added or existing ones get updated.
Runtime Phase --- 3 Steps to Answer a Question
Here's what happens when a user asks something:
- Step 1: Retrieval
- Vectorize the question using the same embedding model
- Pull the top K chunks (usually 3-10) "closest" to the question vector from the vector DB
- Closeness is typically measured with cosine similarity
- Step 2: Augmented
- Embed the retrieved chunks into the prompt as "reference information"
- Something like "Use the following information to answer the question: [search results] Question: [user's question]"
- Step 3: Generation
- The LLM (GPT-4, Claude, Gemini, etc.) generates the answer with the reference material in hand
- Add citations for "which document this came from" as needed
Concrete Example: Asking ChatGPT About Your Company's Leave Policy
The flow for a question like "How many days of paid leave do I get?":
- The question gets vectorized by the embedding model -> [0.12, -0.45, 0.78, ...]
- The vector DB returns 3 chunks related to "leave" and "paid time off"
- Retrieved chunks: "Article 15: Annual Paid Leave. Employees with 6 months of service receive 10 days...", "Up to 20 days based on tenure...", etc.
- Prompt assembly: "Reference: Article 15... Question: How many days of paid leave do I get?"
- The LLM answers: "10 days at 6 months of service, up to 20 days based on tenure (see Employment Rules, Article 15)"
4. The Key Components of a RAG System
Let's walk through the five pieces that make up a RAG.
1. Embedding Model
An AI model that turns text into a numerical vector. It's trained so that "semantically similar texts end up close together in vector space."
| Model | Provider | Notes |
|---|---|---|
| text-embedding-3-small | OpenAI | Cheap and capable, 1536 dimensions |
| text-embedding-3-large | OpenAI | Higher accuracy, 3072 dimensions |
| voyage-3 | Voyage AI | Recommended by Anthropic, high accuracy |
| Cohere Embed v3 | Cohere | Multilingual, strong for non-English including Japanese |
| multilingual-e5-large | Microsoft (OSS) | Runs locally, free |
| BGE-M3 | BAAI (OSS) | 100+ languages, top-tier OSS model |
2. Vector Database
A specialized DB that stores massive numbers of vectors and quickly finds "nearby" ones. We dive into this in the next section.
3. Retriever
Often combines vector search with keyword search (BM25 and friends) or hybrid search.
4. LLM (the Generator)
The large language model that produces the final answer --- GPT-4, Claude, Gemini, Llama 3, and so on. Works with both commercial APIs and self-hosted OSS models.
5. Prompt Template
The template that combines search results and the user's question into a single message for the LLM. A surprisingly important piece for RAG accuracy.
You are an assistant who knows our internal policies inside out.
Answer the question using only the reference information below.
If the answer isn't in the reference, reply with "I don't have that information."
[Reference]
{retrieved_chunks}
[Question]
{user_question}
[Answer]
5. What Is a Vector Database?
Unlike a regular RDBMS (MySQL and friends), a vector DB is purpose-built to "find the nearest neighbors --- the most similar vectors --- in high-dimensional vector space, fast."
Major Vector DBs Compared
| DB | Type | Notes | Pricing |
|---|---|---|---|
| Pinecone | Managed SaaS | Industry standard, dead simple to set up | Free tier, $70/mo+ |
| Weaviate | OSS + cloud | GraphQL API, hybrid search | OSS free, SaaS $25+ |
| Qdrant | OSS + cloud | Built in Rust, very fast, strong filtering | OSS free, free SaaS tier |
| Chroma | OSS | Lightweight, instantly usable from Python | Free (self-hosted) |
| pgvector | PostgreSQL extension | Use it with your existing PostgreSQL | Free (OSS extension) |
| Milvus | OSS + cloud | Scales to billions of vectors | OSS free, Zilliz Cloud |
| Elasticsearch | Search engine | Vector search support, integrates with existing ops | OSS free, managed available |
| Vertex AI Vector Search | Google Cloud | Tight integration with the GCP ecosystem | Pay-as-you-go |
Which One Should You Pick?
- Just want to try it: Chroma (works the moment you pip install)
- Already running PostgreSQL: pgvector (one DB to rule them all)
- Production with minimal ops: Pinecone (no setup needed)
- Serious OSS deployment: Qdrant or Weaviate
- Hundreds of millions to billions of records: Milvus
For more on picking where to host all this, take a look at PaaS (Vercel, etc.) vs Shared Hosting, VPS, and Cloud.
6. Where RAG Is Actually Used
Since 2023, RAG has become one of the most heavily adopted techniques in enterprise AI. Here are the patterns that show up most often.
Use Case 1: Internal Document Q&A (Knowledge Base)
RAG over employment rules, operating manuals, technical specs, meeting notes, and sales decks --- so employees can ask questions just like they'd ask ChatGPT. Microsoft 365 Copilot uses RAG against SharePoint documents in exactly the same way.
Use Case 2: Customer Support Automation
Build RAG over FAQs and support history, then automate first-touch responses with a chatbot. Human operators get to focus on the complex tickets.
Use Case 3: Specialist Q&A in Law and Medicine
RAG over case law databases, medical journals, treatment guidelines. The kind of system attorneys and physicians can lean on day to day. Because citations are explicit, this fits well in any field where evidence matters.
Use Case 4: Research Paper Search and Summarization
RAG over paper databases like arXiv, PubMed, and Google Scholar to answer questions like "what's the latest in this research area?" or "what are similar studies using approach X?" --- Elicit and Perplexity are well-known examples.
Use Case 5: E-Commerce Product Search and FAQ
RAG that integrates product manuals, reviews, and return policies. Now you can do natural-language search like "is this vacuum any good with pet hair?"
Use Case 6: Developer Documentation Chat
RAG over a library's official docs that answers questions like "I want to do X with AWS Lambda --- can you show me sample code?" Stripe, Vercel, Supabase, and others have shipped this.
Use Case 7: Internal Codebase Search and Explanation
RAG over GitHub code that powers tools for "show me how to use this function" or "what other files implement similar logic?" GitHub Copilot Chat and developer AIs like Cursor and Claude Code use RAG-style approaches under the hood.
Use Case 8: New AI-Optimization Standards Like llms.txt
The llms.txt standard for helping AIs reference web information correctly is a natural fit for RAG, letting site operators provide structured information they want AIs to read.
7. RAG vs Fine-Tuning --- Which Should You Pick?
The other approach to "give an LLM custom knowledge" that comes up alongside RAG is fine-tuning. The two take fundamentally different paths.
The Core Difference
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Approach | Hand information from outside at runtime | Retrain the model itself ahead of time |
| Updating knowledge | Just update the DB (instant) | Requires retraining (time, money) |
| Initial cost | Low (just stand up the DB) | High (training data prep + compute) |
| Operating cost | Search + LLM API calls | Inference only (own model) |
| Hallucinations | Low (sources are visible) | Medium (talks about what it learned) |
| Showing citations | Doable | Hard |
| Learning style/voice | Weak | Strong |
| Dynamic data | Strong (real-time data works) | Weak (needs retraining) |
| Confidential data | Can run fully on-prem | Same (but heavier) |
When RAG Is the Right Pick
- Knowledge changes frequently (news, internal docs, product info)
- You need to show evidence for the answer (law, medicine, finance)
- You have huge volumes of documents (training all of it isn't realistic)
- You want to start now (shorter time to ship)
When Fine-Tuning Is the Right Pick
- You need a specific style/tone (brand voice, character)
- You want the model to learn specialty-domain language patterns (medical, legal)
- You want to lower inference cost (shorter prompts)
- You already have large volumes of labeled training data
Combining Both Is the Most Powerful Setup
In practice, RAG and fine-tuning are not competing techniques --- they combine. Use fine-tuning to teach the style; use RAG to feed the latest knowledge. That hybrid setup is common in real production systems.
That said, beginners should start with RAG. It's vastly easier to build and operate than fine-tuning.
8. How to Build It --- A RAG with LangChain
Here are the major frameworks, followed by a minimal Python example.
Major Frameworks
| Framework | Language | Notes |
|---|---|---|
| LangChain | Python / JS | Most widely adopted, huge integration library |
| LlamaIndex | Python | Specialized in data connections and indexing |
| Haystack | Python | Enterprise-oriented, fine-grained control |
| Semantic Kernel | C# / Python | Microsoft, strong .NET integration |
| DSPy | Python | Automates prompt optimization |
| Roll your own | Anything | Simple RAG can be 100 lines of code |
Minimal RAG with LangChain
Let's build a RAG that answers questions from an internal employment-rules PDF, in about 30 lines of LangChain.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# 1. Load the document
loader = PyPDFLoader("rules.pdf")
docs = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50
)
chunks = splitter.split_documents(docs)
# 3. Embed + build the vector DB
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Build the RAG chain
llm = ChatOpenAI(model="gpt-4o-mini")
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True,
)
# 5. Ask a question
result = qa.invoke({"query": "How many days of paid leave do I get?"})
print(result["result"])
print("Sources:", [d.metadata for d in result["source_documents"]])
Run this and it'll search the relevant section of the PDF and have GPT-4o-mini generate the answer. Because we're capturing source metadata too, you can serve responses to the user with citations like "see Article 15."
What a More Production-Ready Build Adds
- Better chunking (semantic splits, hierarchical chunks, etc.)
- Hybrid search (vector + BM25 keyword search)
- Reranking (Cohere Rerank, voyage-rerank, etc., to reorder search results)
- Query rewriting (HyDE, Multi-Query, and similar techniques to improve recall)
- Evaluation pipeline (automated evaluation with RAGAS)
9. RAG Challenges and How to Handle Them
RAG is powerful, but in production you'll bump into the following.
Challenge 1: Chunking Is Hard
How you split your documents has a big impact on retrieval accuracy. Too short and you lose context; too long and search precision drops.
What to do:
- Semantic chunking (split on meaning boundaries)
- Use overlap (let adjacent chunks share some text)
- Hierarchical chunks (store as parent/child; search children, return parents)
Challenge 2: Retrieval Accuracy
You'll grab chunks that look relevant but aren't, and miss critical information.
What to do:
- Hybrid search (vector + BM25 keyword)
- Reranking models to reorder after retrieval
- Multi-query generation (search the same question phrased multiple ways)
Challenge 3: Context Length Limits
There's a cap on the number of tokens you can hand the LLM. You can't shove in unlimited chunks.
What to do:
- Tighten K (top 3-5 chunks)
- Summarize first, then pass
- Use long-context LLMs (Claude 200K, Gemini 1M, etc.)
Challenge 4: Evaluation Is Hard
Measuring RAG answer quality objectively isn't easy. Building a ground-truth dataset is its own problem.
What to do:
- Use RAGAS (an OSS evaluation framework for RAG)
- Automate metrics like answer correctness, answer relevance, retrieval faithfulness
- LLM-as-a-Judge (have a different LLM grade the output)
Challenge 5: Multilingual and Multimodal Content
Documents that mix Japanese and English, PDFs with images, tables and charts --- all tricky.
What to do:
- Use multilingual embedding models (BGE-M3, Cohere Multilingual)
- Pre-extract text from images and tables with an LLM (OCR + VLM)
- Multimodal embeddings (CLIP, Nomic, etc.)
10. Major Tools and Services
Here are the main tools you'd reach for, grouped by category.
Frameworks and Libraries
- LangChain --- the most widely used RAG framework
- LlamaIndex --- focused on data connections
- Haystack --- enterprise-oriented
- DSPy --- automatic prompt optimization
Vector DBs (Managed)
- Pinecone --- industry standard
- Weaviate Cloud --- GraphQL support
- Qdrant Cloud --- high performance
- Zilliz Cloud --- managed Milvus
Vector DBs (OSS / Self-Hosted)
- Chroma --- lightweight, immediate Python use
- Qdrant --- Rust-based, very fast
- Weaviate --- OSS edition
- Milvus --- for very large scale
- pgvector --- PostgreSQL extension
Embedding Models
- OpenAI text-embedding-3 --- the default, cheap
- Voyage AI --- recommended by Anthropic
- Cohere Embed v3 --- multilingual
- BGE-M3 --- top OSS performance
No-Code and Managed RAG Services
- ChatGPT Projects / Custom GPTs --- OpenAI's built-in RAG
- Claude Projects --- Anthropic's built-in RAG
- Notion AI --- search across documents inside Notion
- Microsoft Copilot (Microsoft 365) --- cross-document search across SharePoint and Teams
- Dify --- OSS no-code AI building platform
- Vertex AI Agent Builder --- Google Cloud's RAG-building service
- Amazon Bedrock Knowledge Bases --- AWS managed RAG
Evaluation Tools
- RAGAS --- OSS RAG evaluation framework
- TruLens --- general LLM app evaluation
- LangSmith --- official LangChain tracing and evaluation
FAQ
Q. Can I use RAG with ChatGPT?
Yes. ChatGPT's "Projects" feature and "Custom GPTs" run RAG internally when you upload files (OpenAI calls this the "File Search" feature). If you're a developer who wants to use RAG via API, you can either use the "File Search" tool in OpenAI's Assistants API, or build it yourself with LangChain or similar. Claude offers the same thing through its "Projects" feature.
Q. How much does it cost to run RAG?
Varies enormously with scale. For personal or small (under 10,000 docs, ~1,000 queries/mo), Chroma + the OpenAI API runs around a few dozen dollars a month. Mid-size (100K docs, 100K queries/mo) using Pinecone + GPT-4o lands around several hundred to several thousand dollars/mo. Large enterprise deployments can hit tens of thousands a month. The three main cost drivers: "embedding API," "vector DB," and "LLM API."
Q. What's the difference between RAG and just uploading files to ChatGPT?
Fundamentally the same "retrieval-augmented generation" technique. ChatGPT's file upload feature is RAG running under the covers. The differences: (1) ChatGPT handles a handful to a few dozen files (Projects increases this), but a custom RAG can handle millions; (2) ChatGPT is a black box, while a custom RAG lets you control the search algorithm in detail; (3) ChatGPT runs on OpenAI's servers, while custom RAG can run on-prem. Most serious enterprise deployments build their own RAG.
Q. Does RAG eliminate hallucinations entirely?
No, not entirely. Even with RAG you'll see wrong answers when (1) the relevant document didn't get retrieved, (2) retrieval succeeded but the LLM misinterpreted it, or (3) retrieved chunks contradict each other. Mitigations include prompt constraints like "if it's not in the reference, say 'I don't have that information,'" explicit citations, and ongoing evaluation with RAGAS or similar. Even with all of that you won't hit 100% accuracy, so for high-stakes use cases like medicine and law you should always have a human in the loop.
Q. How do I handle non-English documents?
Three main angles: (1) use a multilingual embedding model (OpenAI text-embedding-3, Cohere Multilingual, BGE-M3, etc.), (2) chunk in a way that respects the morphology and punctuation of your target language, (3) pick an LLM that handles your target language well (GPT-4o, Claude, Gemini, or country-specific models like ELYZA for Japanese). OpenAI's text-embedding-3 handles most major languages well; for highest accuracy in Japanese specifically, BGE-M3 or Cohere is even better.
Q. What's the difference between RAG and an AI agent?
RAG is a fixed "search and answer" pattern. An agent is a dynamic system that "picks the right tools and runs them autonomously, based on a goal." RAG often shows up as one of the tools an agent can call. An agent might switch between "internal search (RAG)," "web search," "calculator," "send email," and so on, with RAG as a building block. There's also "Agentic RAG" --- RAGs in which the LLM itself decides the search strategy.
Q. Is RAG safe? I don't want my confidential data going to an AI
You have several options: (1) put both the vector DB and the embedding work on-prem or inside a VPC (self-host Qdrant, pgvector, etc.), (2) use a locally runnable OSS model for the LLM as well (Llama 3, Qwen, etc.), (3) if you do use an API, sign a contract with OpenAI or Azure OpenAI that data won't be used for training, (4) tag chunks with access-control metadata and filter at query time based on confidentiality level. Fully on-prem RAG is technically achievable, and it's already in use at financial institutions and healthcare providers.
Q. How long does building a RAG take, and what skills do I need?
A prototype is doable in a few hours to a day for a beginner-level Python developer (Chroma + OpenAI API, around 30 lines). A production-grade build with chunking, hybrid search, reranking, and an evaluation pipeline typically takes 1-3 months. The required skills are "Python basics," "knowing how to use an LLM API," and "basic DB operations." You don't need deep machine-learning knowledge --- this is much more accessible to software engineers than to ML researchers.
This article reflects information as of April 2026. RAG-related tools and models change quickly, so check the latest official documentation when you actually build something.