RAG

About

RAG (Retrieval-Augmented Generation), combines retrieval (fetching relevant documents) with generation (producing text) to enhance AI responses with external knowledge.

The ideia is that when you input a query, the model first retrieves relevant documents or passages from a large corpus, and then uses that information to generate a more informed and accurate response.

RAG's purpose is to overcome static knowledge limits of traditional models like GPT by dynamically accessing external data.

The retrieval part works using a dense vector search where documents are encoded into vectors, and the query is also encoded, then the closest vectors are retrieved based on similarity.

Then the generator part takes the retrieved documents and the original query to generate the answer.

Applications of RAG would include question answering, chatbots, and any task where accessing external knowledge is beneficial.

Involved knowledge

  • Dense retrieval methods.

  • Approximate nearest neighbor algorithms.

  • Sequence-to-sequence (seq2seq) models.

Core Components

Retriver

Encodes queries/documents into vectors (e.g. using DPR, BM25, or dense embeddings).

It uses vector databases (FAISS, Annoy) for efficient similarity search.

Advanced topics

  • Multi-hop retrieval: Iterative fetching for complex queries.

  • Hybrid retrieval: Combine dense vectors (DPR) with sparce methods (BM25)

Generator

Is a sequence-to-sequence model (e.g. BART, T5) that generates answers using retrieved documents and the query.

Chunking Strategies

When to use which strategy:

  • Start Simple: Fixed-size or sentence-based chunking for prototyping.

  • Scale Up: Hybrid or dynamic chunking for production systems.

  • Specialized Documents: Use markup/structure-aware chunking for technical docs.

  • Optimized Pipelines: Adaptive chunking if you have resources for feedback loops.

Fixed-size (naive)

Split text into chunks of fixed token/character length (e.g 256 tokens).

Example:

Split every 100 words or 512 cheracters.

# Example with LangChain
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = text_splitter.split_text(document)

Sentence/paragraph-based

Split at natural boundaries (sentences, paragraphs, or sections)

Use NLP tools like spaCy, NLTK, or sent_tokenize to detect sentence/paragraph boundaries.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(document)
chunks = [sent.text for sent in doc.sents]  # Split by sentences

Content-aware (markup/structure)

Chunk based on document structure (e.g., Markdown headers, HTML tags, LaTeX sections).

Example:

Split at ## Heading in Markdown or <h2> in HTML.

# Split Markdown by headers
import re
chunks = re.split(r'\n## ', document)

Recursive chunking

Use a hierarchy of splitters.

Example:

Split by paragraph first, then split large paragraphs into sentences

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_text(document)

Topic modeling/clustering

Group text by topics (e.g., using NLP models like BERTopic, LDA, or clustering algorithms).

Example:

Use embeddings to cluster semantically similar sentences.

from bertopic import BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform([document])
# Extract chunks based on topic clusters

Dynamic context-aware

Use embeddings or LLMs to dynamically decide chunk boundaries.

Example:

  1. Embed sentences with sentence-transformers.

  2. Merge adjacent sentences until embedding similarity drops below a threshold.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [sent.text for sent in doc.sents]
embeddings = model.encode(sentences)
# Merge sentences with cosine similarity > 0.8

Hybrid (multi-granularity)

Combine multiple strategies (e.g., split by section headers first, then use recursive chunking within sections).

Example:

  1. Split a medical paper into "Abstract," "Methods," "Results."

  2. Chunk "Methods" into smaller procedural steps.

???

Adaptive with feedback loops

Use retrieval/generation performance to iteratively refine chunking (e.g., reinforce chunks that improve RAG answers).

Example:

  1. Train a classifier to predict which chunks lead to high-quality answers.

  2. Optimize chunk size/boundaries based on feedback.

???

Build from Scratch (Thoughts)

HugginFace Rag tutorial.

Here are some thoughts to consider if creating the RAG from scratch.

  1. Setup a document store.

    1. Preprocess and chunk documents (e.g., into 100-word passages)

    2. Encode and index using FAISS or Elasticsearch.

  2. Setup the retriever first.

    1. Choose model (e.g. DPR or sentence-transformers).

    2. Or build one considering:

      1. Maybe use a bi-encoder architecture where both the query and documents are encoded into vectors.

      2. Then use a nearest neighbor search to find the top-k documents.

  3. For the generator.

    1. Choose model (e.g. Fine-tune BART or T5)

    2. Or build one considering:

      1. A seq2seq model that takes the query and retrieved documents to generate the answer.

  4. Optimize:

    1. Fine-tune retriever and generator jointly for better relevance.

    2. Experiment with chunk sizes and retrieval thresholds.

Challenges to face

  • Ensuring the retriver fetches relevant documents.

  • Handling large document corpora efficiently.

  • Integrating the retrieval and generation steps smoothly.

  • Latency can be an issue if the retrieval step is too slow, so optimizing the retrieval process with something like FAISS or Annoy for approximate nearest neighbors.

Evaluation metrics for RAG systems would include

  • Retrieval metrics:

    • Like recall@k, mean reciprocal rank (MRR);

  • Generation metrics:

    • Like BLEU, ROUGE, or human evaluations for accuracy/fluency.

Potential pitfalls

If the retriever is not well-tuned, it might fetch irrelevant documents, leading the generator to produce incorrect answers.

Also the generator might ignore the retrieved documents and rely solely on its parametric knowledge, defeating the purpose of RAG.

Mitigate the pitfalls

With joint training of retriver and generator. Techniques like gradient backpropagation through the retrieval step or reinforcement learning might be used.

Requirements

Data preprocessing is essential. Documents need to be split into passages or chunks, encoded and indexed.

The chunk size matters - too small and they might lack context, too large and the retriever might not find the most relevant parts.

Last updated