RAG

About

RAG (Retrieval-Augmented Generation), combines retrieval (fetching relevant documents) with generation (producing text) to enhance AI responses with external knowledge.

The ideia is that when you input a query, the model first retrieves relevant documents or passages from a large corpus, and then uses that information to generate a more informed and accurate response.

RAG's purpose is to overcome static knowledge limits of traditional models like GPT by dynamically accessing external data.

The retrieval part works using a dense vector search where documents are encoded into vectors, and the query is also encoded, then the closest vectors are retrieved based on similarity.

Then the generator part takes the retrieved documents and the original query to generate the answer.

RAG models can dynamically pull in external data, which makes them more powerful for tasks requiring up-to-date or specific information.

Applications of RAG would include question answering, chatbots, and any task where accessing external knowledge is beneficial.

Involved knowledge

Dense retrieval methods.
Approximate nearest neighbor algorithms.
Sequence-to-sequence (seq2seq) models.

Core Components

Retriver

Encodes queries/documents into vectors (e.g. using DPR, BM25, or dense embeddings).

It uses vector databases (FAISS, Annoy) for efficient similarity search.

Advanced topics

Multi-hop retrieval: Iterative fetching for complex queries.
Hybrid retrieval: Combine dense vectors (DPR) with sparce methods (BM25)

Generator

Is a sequence-to-sequence model (e.g. BART, T5) that generates answers using retrieved documents and the query.

Chunking Strategies

When to use which strategy:

Start Simple: Fixed-size or sentence-based chunking for prototyping.
Scale Up: Hybrid or dynamic chunking for production systems.
Specialized Documents: Use markup/structure-aware chunking for technical docs.
Optimized Pipelines: Adaptive chunking if you have resources for feedback loops.

Key considerations for RAG chunking:

Context Window Limits: Ensure chunks fit within your generator’s max token limit (e.g., 512 tokens for BERT).
Retrieval vs. Generation Needs: Smaller chunks improve retrieval accuracy but may lack context for generation.
Overlap: Add 10-20% overlap between chunks to preserve context (e.g., 64 tokens overlap for 512-token chunks).
Domain-Specificity: Legal documents need different chunking than conversational text.

Fixed-size (naive)

Split text into chunks of fixed token/character length (e.g 256 tokens).

Example:

Split every 100 words or 512 cheracters.

Predictable for indexing.

Risks splitting sentences/paragraphs mid-context, harming semantic coherence.

May miss relationships between adjancent chunks.

# Example with LangChain
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = text_splitter.split_text(document)

Sentence/paragraph-based

Split at natural boundaries (sentences, paragraphs, or sections)

Use NLP tools like spaCy, NLTK, or sent_tokenize to detect sentence/paragraph boundaries.

Preserves semantic units.

Better for retrieval quality than fixed-size.

Variable chunk sizes (may conflict with model token limits).

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(document)
chunks = [sent.text for sent in doc.sents]  # Split by sentences

Content-aware (markup/structure)

Chunk based on document structure (e.g., Markdown headers, HTML tags, LaTeX sections).

Example:

Split at ## Heading in Markdown or <h2> in HTML.

Leverages existing document hierarchy for meaningful chunks.

Great for technical docs, manuals, or articles.

Requires parsing markup/structure (not all documents are structured).

# Split Markdown by headers
import re
chunks = re.split(r'\n## ', document)

Recursive chunking

Use a hierarchy of splitters.

Example:

Split by paragraph first, then split large paragraphs into sentences

Balances semantic integrity and size constraints.

Handles variable document structures.

Requires tuning (chunk size, overlap, hierarchy).

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(document)

Topic modeling/clustering

Group text by topics (e.g., using NLP models like BERTopic, LDA, or clustering algorithms).

Example:

Use embeddings to cluster semantically similar sentences.

Chunks are thematically coherent.

Reduces noise in retrieval.

Computationally intensive.

Requires labeled data or tuning for unsupervised methods.

from bertopic import BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform([document])
# Extract chunks based on topic clusters

Dynamic context-aware

Use embeddings or LLMs to dynamically decide chunk boundaries.

Example:

Embed sentences with sentence-transformers.
Merge adjacent sentences until embedding similarity drops below a threshold.

Optimizes for semantic continuity.

Adaptive to document content.

High computational cost (embedding every sentence).

Complexity in similarity threshold tuning.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [sent.text for sent in doc.sents]
embeddings = model.encode(sentences)
# Merge sentences with cosine similarity > 0.8

Hybrid (multi-granularity)

Combine multiple strategies (e.g., split by section headers first, then use recursive chunking within sections).

Example:

Split a medical paper into "Abstract," "Methods," "Results."
Chunk "Methods" into smaller procedural steps.

Maximizes context preservation at multiple levels.

Ideal for complex documents (e.g., research papers, legal contracts).

Requires domain-specific rules.

???

Adaptive with feedback loops

Use retrieval/generation performance to iteratively refine chunking (e.g., reinforce chunks that improve RAG answers).

Example:

Train a classifier to predict which chunks lead to high-quality answers.
Optimize chunk size/boundaries based on feedback.

Tailored to your specific RAG pipeline.

Maximizes end-to-end performance.

Requires labeled evaluation data and ML expertise.

???

Build from Scratch (Thoughts)

HugginFace Rag tutorial.

Here are some thoughts to consider if creating the RAG from scratch.

Setup a document store.
1. Preprocess and chunk documents (e.g., into 100-word passages)
2. Encode and index using FAISS or Elasticsearch.
Setup the retriever first.
1. Choose model (e.g. DPR or sentence-transformers).
2. Or build one considering:
  1. Maybe use a bi-encoder architecture where both the query and documents are encoded into vectors.
  2. Then use a nearest neighbor search to find the top-k documents.
For the generator.
1. Choose model (e.g. Fine-tune BART or T5)
2. Or build one considering:
  1. A seq2seq model that takes the query and retrieved documents to generate the answer.
Optimize:
1. Fine-tune retriever and generator jointly for better relevance.
2. Experiment with chunk sizes and retrieval thresholds.

Challenges to face

Ensuring the retriver fetches relevant documents.
Handling large document corpora efficiently.
Integrating the retrieval and generation steps smoothly.
Latency can be an issue if the retrieval step is too slow, so optimizing the retrieval process with something like FAISS or Annoy for approximate nearest neighbors.

Evaluation metrics for RAG systems would include

Retrieval metrics:
- Like recall@k, mean reciprocal rank (MRR);
Generation metrics:
- Like BLEU, ROUGE, or human evaluations for accuracy/fluency.

Potential pitfalls

If the retriever is not well-tuned, it might fetch irrelevant documents, leading the generator to produce incorrect answers.

Also the generator might ignore the retrieved documents and rely solely on its parametric knowledge, defeating the purpose of RAG.

Mitigate the pitfalls

With joint training of retriver and generator. Techniques like gradient backpropagation through the retrieval step or reinforcement learning might be used.

Requirements

Data preprocessing is essential. Documents need to be split into passages or chunks, encoded and indexed.

The chunk size matters - too small and they might lack context, too large and the retriever might not find the most relevant parts.

PreviousPrompting Engineering NextVector Embedding

Last updated 4 months ago