Comparative Evaluation of Retrieval Strategies for RAG

A reproducible machine learning benchmark that compares four retrieval strategies for Retrieval-Augmented Generation (RAG) on the BEIR SciFact dataset. The project measures retrieval quality with RAGAS metrics and tracks retrieval and end-to-end latency across pipelines.

Project Overview

This project evaluates how different retrieval methods affect RAG answer quality on financial question-answering data. Four pipelines are compared:

Method	Description
BM25	Lexical sparse retrieval
Dense	Embedding-based semantic retrieval (FAISS)
Hybrid	BM25 + Dense fused with Reciprocal Rank Fusion
Hybrid + Reranker	Hybrid retrieval with cross-encoder reranking

Each pipeline retrieves documents, generates answers with Gemini 2.5 Flash, and is scored with RAGAS metrics. Results are exported as CSV leaderboards and matplotlib visualizations.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        experiments/run_all.py                       │
│                     (Unified Experiment Runner)                     │
└─────────────────────────────────────────────────────────────────────┘
                                    │
          ┌─────────────────────────┼─────────────────────────┐
          ▼                         ▼                         ▼
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   ingestion/     │    │   retrieval/     │    │   generation/    │
│  load_beir.py    │    │   bm25.py        │    │ gemini_generator │
│  embeddings.py   │    │   dense.py       │    │       .py        │
└──────────────────┘    │   hybrid.py      │    └──────────────────┘
          │             │   reranker.py    │              │
          │             └──────────────────┘              │
          │                         │                       │
          └─────────────────────────┼───────────────────────┘
                                    ▼
                    ┌───────────────────────────────┐
                    │         evaluation/           │
                    │  ragas_eval.py                │
                    │  latency_eval.py              │
                    └───────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │       visualization/          │
                    │      leaderboard.py           │
                    └───────────────────────────────┘
                                    │
                                    ▼
                         results/leaderboard.csv
                         results/leaderboard.png

Retrieval Pipelines

Pipeline 1: BM25

Query → BM25 (rank-bm25) → Top-K Documents

Tokenizes corpus and queries, ranks documents by BM25 relevance scores.

Pipeline 2: Dense Retrieval

Query → BGE Embedding → FAISS Similarity Search → Top-K Documents

Encodes queries and documents with BAAI/bge-small-en-v1.5, searches a FAISS inner-product index with normalized embeddings.

Pipeline 3: Hybrid Retrieval

Query → BM25 Results ──┐
                       ├→ RRF Fusion → Top-K Documents
Query → Dense Results ─┘

Combines BM25 and dense ranked lists using standard Reciprocal Rank Fusion (RRF):

score(d) = Σ 1 / (k + rank(d))

Pipeline 4: Hybrid + Reranker

Query → Hybrid Retrieval → Top-20 Candidates → BGE Reranker → Top-K Documents

Retrieves candidates with hybrid search, then reranks with BAAI/bge-reranker-base cross-encoder. Expected to be the strongest pipeline.

Dataset Description

BEIR FiQA-2018 is a financial question-answering benchmark from the BEIR collection. It contains:

Financial forum posts and Stack Exchange questions as queries
Expert-annotated relevance judgments (qrels)
A corpus of financial documents

The dataset is automatically downloaded on first run and stored under data/fiqa/. The official BEIR test split is used. Query count is configurable via num_queries in config.yaml.

Installation

Prerequisites

Python 3.10+
Google Gemini API key

Setup

git clone <repository-url>
cd rag-beir-benchmark

python -m venv .venv
source .venv/bin/activate        # Linux/macOS
# .venv\Scripts\activate         # Windows

pip install -r requirements.txt

Set your Gemini API key:

export GOOGLE_API_KEY="your-api-key-here"   # Linux/macOS
# set GOOGLE_API_KEY=your-api-key-here      # Windows

Running Experiments

Run the full benchmark from the project root:

python experiments/run_all.py

This will:

Download FiQA-2018 if missing
Build BM25 and FAISS indexes (cached for subsequent runs)
Run all four retrieval pipelines
Generate answers with Gemini 2.5 Flash
Evaluate with RAGAS metrics
Save CSV results and leaderboard visualization

Configuration

All parameters are in config.yaml:

dataset:
  path: data/fiqa
  beir_dataset: fiqa
  split: test

top_k: 5
rerank_top_n: 20
num_queries: 50

embedding_model: BAAI/bge-small-en-v1.5
reranker_model: BAAI/bge-reranker-base
gemini_model: gemini-2.5-flash

Adjust num_queries to control evaluation subset size. Set to 0 or a large value to use all queries.

Evaluation Metrics

RAGAS Metrics

Metric	Description
Context Precision	Fraction of retrieved contexts that are relevant
Context Recall	Fraction of relevant contexts that were retrieved
Faithfulness	Whether the generated answer is grounded in retrieved context
Answer Relevancy	How relevant the generated answer is to the question

Latency Metrics

Metric	Description
Retrieval Latency	Time spent retrieving documents only (seconds)
End-to-End Latency	Time from query to final answer (seconds)

Sample Results

After a successful run, find outputs in results/:

results/
├── bm25_results.csv
├── dense_results.csv
├── hybrid_results.csv
├── reranker_results.csv
├── leaderboard.csv
└── leaderboard.png

Example leaderboard (illustrative):

Method	Context Precision	Context Recall	Faithfulness	Answer Relevancy	Retrieval Latency	End-to-End Latency
Hybrid + Reranker	0.72	0.68	0.85	0.79	0.45	1.82
Hybrid	0.65	0.61	0.78	0.74	0.12	1.49
Dense	0.58	0.55	0.71	0.68	0.08	1.45
BM25	0.51	0.48	0.64	0.61	0.02	1.39

Actual scores depend on query subset, API latency, and model versions.

Future Work

Query expansion — Add HyDE or multi-query retrieval to improve recall
Larger query sets — Scale evaluation to the full FiQA test set
Additional datasets — Extend benchmark to BEIR NFCorpus, SciFact, or custom corpora
Chunking strategies — Compare fixed-size vs. semantic chunking for long documents
Cost tracking — Log Gemini token usage and API cost per pipeline
Statistical significance — Add bootstrap confidence intervals for metric comparisons
Ablation studies — Isolate impact of RRF parameter k, rerank candidate count, and embedding model choice

Project Structure

rag-beir-benchmark/
├── config.yaml
├── requirements.txt
├── README.md
├── utils.py
├── data/
├── ingestion/
│   ├── load_beir.py
│   └── embeddings.py
├── retrieval/
│   ├── bm25.py
│   ├── dense.py
│   ├── hybrid.py
│   └── reranker.py
├── generation/
│   └── gemini_generator.py
├── evaluation/
│   ├── ragas_eval.py
│   └── latency_eval.py
├── experiments/
│   └── run_all.py
├── visualization/
│   └── leaderboard.py
└── results/

License

This project is intended for research and portfolio use. BEIR datasets are subject to their respective licenses. Model weights are governed by their Hugging Face model cards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparative Evaluation of Retrieval Strategies for RAG

Project Overview

Architecture Diagram

Retrieval Pipelines

Pipeline 1: BM25

Pipeline 2: Dense Retrieval

Pipeline 3: Hybrid Retrieval

Pipeline 4: Hybrid + Reranker

Dataset Description

Installation

Prerequisites

Setup

Running Experiments

Configuration

Evaluation Metrics

RAGAS Metrics

Latency Metrics

Sample Results

Future Work

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evaluation		evaluation
experiments		experiments
generation		generation
ingestion		ingestion
results		results
retrieval		retrieval
visualization		visualization
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
gemini_rate_limit.py		gemini_rate_limit.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Comparative Evaluation of Retrieval Strategies for RAG

Project Overview

Architecture Diagram

Retrieval Pipelines

Pipeline 1: BM25

Pipeline 2: Dense Retrieval

Pipeline 3: Hybrid Retrieval

Pipeline 4: Hybrid + Reranker

Dataset Description

Installation

Prerequisites

Setup

Running Experiments

Configuration

Evaluation Metrics

RAGAS Metrics

Latency Metrics

Sample Results

Future Work

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages