RAG
About
RAG (Retrieval-Augmented Generation), combines retrieval (fetching relevant documents) with generation (producing text) to enhance AI responses with external knowledge.
The ideia is that when you input a query, the model first retrieves relevant documents or passages from a large corpus, and then uses that information to generate a more informed and accurate response.
The retrieval part works using a dense vector search where documents are encoded into vectors, and the query is also encoded, then the closest vectors are retrieved based on similarity.
Then the generator part takes the retrieved documents and the original query to generate the answer.
RAG models can dynamically pull in external data, which makes them more powerful for tasks requiring up-to-date or specific information.
Applications of RAG would include question answering, chatbots, and any task where accessing external knowledge is beneficial.
Involved knowledge
Dense retrieval methods.
Approximate nearest neighbor algorithms.
Sequence-to-sequence (
seq2seq
) models.
Core Components
Retriver
Encodes queries/documents into vectors (e.g. using DPR
, BM25
, or dense embeddings).
It uses vector databases (FAISS
, Annoy
) for efficient similarity search.
Advanced topics
Multi-hop retrieval: Iterative fetching for complex queries.
Hybrid retrieval: Combine dense vectors (
DPR
) with sparce methods (BM25
)
Generator
Is a sequence-to-sequence model (e.g. BART
, T5
) that generates answers using retrieved documents and the query.
Chunking Strategies
Key considerations for RAG chunking:
Context Window Limits: Ensure chunks fit within your generator’s max token limit (e.g., 512 tokens for BERT).
Retrieval vs. Generation Needs: Smaller chunks improve retrieval accuracy but may lack context for generation.
Overlap: Add 10-20% overlap between chunks to preserve context (e.g., 64 tokens overlap for 512-token chunks).
Domain-Specificity: Legal documents need different chunking than conversational text.
Fixed-size (naive)
Split text into chunks of fixed token/character length (e.g 256 tokens).
Predictable for indexing.
Risks splitting sentences/paragraphs mid-context, harming semantic coherence.
May miss relationships between adjancent chunks.
# Example with LangChain
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = text_splitter.split_text(document)
Sentence/paragraph-based
Split at natural boundaries (sentences, paragraphs, or sections)
Use NLP tools like spaCy, NLTK, or sent_tokenize
to detect sentence/paragraph boundaries.
Preserves semantic units.
Better for retrieval quality than fixed-size.
Variable chunk sizes (may conflict with model token limits).
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(document)
chunks = [sent.text for sent in doc.sents] # Split by sentences
Content-aware (markup/structure)
Chunk based on document structure (e.g., Markdown headers, HTML tags, LaTeX sections).
Leverages existing document hierarchy for meaningful chunks.
Great for technical docs, manuals, or articles.
Requires parsing markup/structure (not all documents are structured).
# Split Markdown by headers
import re
chunks = re.split(r'\n## ', document)
Recursive chunking
Use a hierarchy of splitters.
Balances semantic integrity and size constraints.
Handles variable document structures.
Requires tuning (chunk size, overlap, hierarchy).
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(document)
Topic modeling/clustering
Group text by topics (e.g., using NLP models like BERTopic, LDA, or clustering algorithms).
Chunks are thematically coherent.
Reduces noise in retrieval.
Computationally intensive.
Requires labeled data or tuning for unsupervised methods.
from bertopic import BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform([document])
# Extract chunks based on topic clusters
Dynamic context-aware
Use embeddings or LLMs to dynamically decide chunk boundaries.
Optimizes for semantic continuity.
Adaptive to document content.
High computational cost (embedding every sentence).
Complexity in similarity threshold tuning.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [sent.text for sent in doc.sents]
embeddings = model.encode(sentences)
# Merge sentences with cosine similarity > 0.8
Hybrid (multi-granularity)
Combine multiple strategies (e.g., split by section headers first, then use recursive chunking within sections).
Maximizes context preservation at multiple levels.
Ideal for complex documents (e.g., research papers, legal contracts).
Requires domain-specific rules.
???
Adaptive with feedback loops
Use retrieval/generation performance to iteratively refine chunking (e.g., reinforce chunks that improve RAG answers).
Tailored to your specific RAG pipeline.
Maximizes end-to-end performance.
Requires labeled evaluation data and ML expertise.
???
Build from Scratch (Thoughts)
Here are some thoughts to consider if creating the RAG from scratch.
Setup a document store.
Preprocess and chunk documents (e.g., into 100-word passages)
Encode and index using
FAISS
orElasticsearch
.
Setup the retriever first.
Choose model (e.g.
DPR
orsentence-transformers
).Or build one considering:
Maybe use a bi-encoder architecture where both the query and documents are encoded into vectors.
Then use a
nearest neighbor search
to find the top-k documents.
For the generator.
Choose model (e.g. Fine-tune
BART
orT5
)Or build one considering:
A
seq2seq
model that takes the query and retrieved documents to generate the answer.
Optimize:
Fine-tune retriever and generator jointly for better relevance.
Experiment with chunk sizes and retrieval thresholds.
Challenges to face
Ensuring the retriver fetches relevant documents.
Handling large document corpora efficiently.
Integrating the retrieval and generation steps smoothly.
Latency can be an issue if the retrieval step is too slow, so optimizing the retrieval process with something like
FAISS
orAnnoy
for approximate nearest neighbors.
Evaluation metrics for RAG systems would include
Retrieval metrics:
Like
recall@k
, mean reciprocal rank (MRR
);
Generation metrics:
Like
BLEU
,ROUGE
, or human evaluations for accuracy/fluency.
Potential pitfalls
If the retriever is not well-tuned, it might fetch irrelevant documents, leading the generator to produce incorrect answers.
Also the generator might ignore the retrieved documents and rely solely on its parametric knowledge, defeating the purpose of RAG.
Mitigate the pitfalls
With joint training of retriver and generator. Techniques like gradient backpropagation through the retrieval step or reinforcement learning might be used.
Requirements
Data preprocessing is essential. Documents need to be split into passages or chunks, encoded and indexed.
The chunk size matters - too small and they might lack context, too large and the retriever might not find the most relevant parts.
Last updated