A reproducible machine learning benchmark that compares four retrieval strategies for Retrieval-Augmented Generation (RAG) on the BEIR SciFact dataset. The project measures retrieval quality with RAGAS metrics and tracks retrieval and end-to-end latency across pipelines.
This project evaluates how different retrieval methods affect RAG answer quality on financial question-answering data. Four pipelines are compared:
| Method | Description |
|---|---|
| BM25 | Lexical sparse retrieval |
| Dense | Embedding-based semantic retrieval (FAISS) |
| Hybrid | BM25 + Dense fused with Reciprocal Rank Fusion |
| Hybrid + Reranker | Hybrid retrieval with cross-encoder reranking |
Each pipeline retrieves documents, generates answers with Gemini 2.5 Flash, and is scored with RAGAS metrics. Results are exported as CSV leaderboards and matplotlib visualizations.
┌─────────────────────────────────────────────────────────────────────┐
│ experiments/run_all.py │
│ (Unified Experiment Runner) │
└─────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ ingestion/ │ │ retrieval/ │ │ generation/ │
│ load_beir.py │ │ bm25.py │ │ gemini_generator │
│ embeddings.py │ │ dense.py │ │ .py │
└──────────────────┘ │ hybrid.py │ └──────────────────┘
│ │ reranker.py │ │
│ └──────────────────┘ │
│ │ │
└─────────────────────────┼───────────────────────┘
▼
┌───────────────────────────────┐
│ evaluation/ │
│ ragas_eval.py │
│ latency_eval.py │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ visualization/ │
│ leaderboard.py │
└───────────────────────────────┘
│
▼
results/leaderboard.csv
results/leaderboard.png
Query → BM25 (rank-bm25) → Top-K Documents
Tokenizes corpus and queries, ranks documents by BM25 relevance scores.
Query → BGE Embedding → FAISS Similarity Search → Top-K Documents
Encodes queries and documents with BAAI/bge-small-en-v1.5, searches a FAISS inner-product index with normalized embeddings.
Query → BM25 Results ──┐
├→ RRF Fusion → Top-K Documents
Query → Dense Results ─┘
Combines BM25 and dense ranked lists using standard Reciprocal Rank Fusion (RRF):
score(d) = Σ 1 / (k + rank(d))
Query → Hybrid Retrieval → Top-20 Candidates → BGE Reranker → Top-K Documents
Retrieves candidates with hybrid search, then reranks with BAAI/bge-reranker-base cross-encoder. Expected to be the strongest pipeline.
BEIR FiQA-2018 is a financial question-answering benchmark from the BEIR collection. It contains:
- Financial forum posts and Stack Exchange questions as queries
- Expert-annotated relevance judgments (qrels)
- A corpus of financial documents
The dataset is automatically downloaded on first run and stored under data/fiqa/. The official BEIR test split is used. Query count is configurable via num_queries in config.yaml.
- Python 3.10+
- Google Gemini API key
git clone <repository-url>
cd rag-beir-benchmark
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -r requirements.txtSet your Gemini API key:
export GOOGLE_API_KEY="your-api-key-here" # Linux/macOS
# set GOOGLE_API_KEY=your-api-key-here # WindowsRun the full benchmark from the project root:
python experiments/run_all.pyThis will:
- Download FiQA-2018 if missing
- Build BM25 and FAISS indexes (cached for subsequent runs)
- Run all four retrieval pipelines
- Generate answers with Gemini 2.5 Flash
- Evaluate with RAGAS metrics
- Save CSV results and leaderboard visualization
All parameters are in config.yaml:
dataset:
path: data/fiqa
beir_dataset: fiqa
split: test
top_k: 5
rerank_top_n: 20
num_queries: 50
embedding_model: BAAI/bge-small-en-v1.5
reranker_model: BAAI/bge-reranker-base
gemini_model: gemini-2.5-flashAdjust num_queries to control evaluation subset size. Set to 0 or a large value to use all queries.
| Metric | Description |
|---|---|
| Context Precision | Fraction of retrieved contexts that are relevant |
| Context Recall | Fraction of relevant contexts that were retrieved |
| Faithfulness | Whether the generated answer is grounded in retrieved context |
| Answer Relevancy | How relevant the generated answer is to the question |
| Metric | Description |
|---|---|
| Retrieval Latency | Time spent retrieving documents only (seconds) |
| End-to-End Latency | Time from query to final answer (seconds) |
After a successful run, find outputs in results/:
results/
├── bm25_results.csv
├── dense_results.csv
├── hybrid_results.csv
├── reranker_results.csv
├── leaderboard.csv
└── leaderboard.png
Example leaderboard (illustrative):
| Method | Context Precision | Context Recall | Faithfulness | Answer Relevancy | Retrieval Latency | End-to-End Latency |
|---|---|---|---|---|---|---|
| Hybrid + Reranker | 0.72 | 0.68 | 0.85 | 0.79 | 0.45 | 1.82 |
| Hybrid | 0.65 | 0.61 | 0.78 | 0.74 | 0.12 | 1.49 |
| Dense | 0.58 | 0.55 | 0.71 | 0.68 | 0.08 | 1.45 |
| BM25 | 0.51 | 0.48 | 0.64 | 0.61 | 0.02 | 1.39 |
Actual scores depend on query subset, API latency, and model versions.
- Query expansion — Add HyDE or multi-query retrieval to improve recall
- Larger query sets — Scale evaluation to the full FiQA test set
- Additional datasets — Extend benchmark to BEIR NFCorpus, SciFact, or custom corpora
- Chunking strategies — Compare fixed-size vs. semantic chunking for long documents
- Cost tracking — Log Gemini token usage and API cost per pipeline
- Statistical significance — Add bootstrap confidence intervals for metric comparisons
- Ablation studies — Isolate impact of RRF parameter
k, rerank candidate count, and embedding model choice
rag-beir-benchmark/
├── config.yaml
├── requirements.txt
├── README.md
├── utils.py
├── data/
├── ingestion/
│ ├── load_beir.py
│ └── embeddings.py
├── retrieval/
│ ├── bm25.py
│ ├── dense.py
│ ├── hybrid.py
│ └── reranker.py
├── generation/
│ └── gemini_generator.py
├── evaluation/
│ ├── ragas_eval.py
│ └── latency_eval.py
├── experiments/
│ └── run_all.py
├── visualization/
│ └── leaderboard.py
└── results/
This project is intended for research and portfolio use. BEIR datasets are subject to their respective licenses. Model weights are governed by their Hugging Face model cards.