Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is an AI architecture that improves the accuracy of language model responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on what the model memorized during training, RAG fetches up-to-date, domain-specific data and includes it in the model's context.

How RAG works

RAG operates in two phases. First, the retrieval step: when a user asks a question, the system searches a knowledge base (documents, code, databases) for relevant information using semantic search with embeddings. Second, the generation step: the retrieved information is added to the model's prompt as context, and the LLM generates a response grounded in that specific data. This means the model can reference documentation it was never trained on and provide answers that are current and specific to your domain.

RAG in AI coding tools

AI coding tools use RAG-like patterns extensively. When Cursor or Cline indexes your codebase, it creates embeddings of your files and retrieves relevant code when you ask a question. This is why these tools can answer questions about your specific project rather than just giving generic coding advice. Claude Code takes a different approach—it reads files directly from your filesystem on demand rather than pre-indexing, but the principle is the same: ground AI responses in your actual code.

typescript

// Simplified RAG pipeline for a coding assistant

// 1. Index: convert code files to embeddings
const chunks = splitCodeIntoChunks(codebase);
const embeddings = await model.embed(chunks);
await vectorDB.upsert(embeddings);

// 2. Retrieve: find relevant code for the query
const queryEmbedding = await model.embed(userQuestion);
const relevantCode = await vectorDB.search(queryEmbedding, { topK: 10 });

// 3. Generate: answer with retrieved context
const response = await llm.generate({
  prompt: `Given this code context:\n${relevantCode}\n\nAnswer: ${userQuestion}`
});

RAG quality depends on retrieval quality. If the system retrieves irrelevant code, the LLM generates irrelevant answers. Good chunking strategies and embedding models matter as much as the generation model.

What is the difference between RAG and fine-tuning?+

Fine-tuning changes the model's weights by training it on new data. RAG leaves the model unchanged and instead provides relevant information at query time. RAG is better for frequently changing data (like a codebase), while fine-tuning is better for teaching the model new capabilities or persistent knowledge.

Do all AI coding tools use RAG?+

Most use RAG-like patterns but implement them differently. Cursor and Cline pre-index your codebase with embeddings. Claude Code reads files on demand from the filesystem. Both approaches ground the AI in your actual code rather than relying on training data alone.

What are the limitations of RAG?+

RAG is limited by retrieval quality—if the right information is not retrieved, the model cannot use it. It also adds latency (search step before generation) and is constrained by the model's context window size. Complex queries that need information scattered across many files can exceed what retrieval can effectively gather.

Related comparisons

Claude Code vs Cursor →Claude Code vs Cline →

Master Claude Code in days, not months

37 hands-on lessons from beginner to CI/CD automation. Module 1 is free.

START FREE →

← ALL TERMS

How RAG works

RAG in AI coding tools

Related terms

Related comparisons