Skip to content

CodeMatrix1/Retrieval-Techniques

Repository files navigation

Comparative Evaluation of Retrieval Strategies for RAG

A reproducible machine learning benchmark that compares four retrieval strategies for Retrieval-Augmented Generation (RAG) on the BEIR SciFact dataset. The project measures retrieval quality with RAGAS metrics and tracks retrieval and end-to-end latency across pipelines.

Project Overview

This project evaluates how different retrieval methods affect RAG answer quality on financial question-answering data. Four pipelines are compared:

Method Description
BM25 Lexical sparse retrieval
Dense Embedding-based semantic retrieval (FAISS)
Hybrid BM25 + Dense fused with Reciprocal Rank Fusion
Hybrid + Reranker Hybrid retrieval with cross-encoder reranking

Each pipeline retrieves documents, generates answers with Gemini 2.5 Flash, and is scored with RAGAS metrics. Results are exported as CSV leaderboards and matplotlib visualizations.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        experiments/run_all.py                       │
│                     (Unified Experiment Runner)                     │
└─────────────────────────────────────────────────────────────────────┘
                                    │
          ┌─────────────────────────┼─────────────────────────┐
          ▼                         ▼                         ▼
┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│   ingestion/     │    │   retrieval/     │    │   generation/    │
│  load_beir.py    │    │   bm25.py        │    │ gemini_generator │
│  embeddings.py   │    │   dense.py       │    │       .py        │
└──────────────────┘    │   hybrid.py      │    └──────────────────┘
          │             │   reranker.py    │              │
          │             └──────────────────┘              │
          │                         │                       │
          └─────────────────────────┼───────────────────────┘
                                    ▼
                    ┌───────────────────────────────┐
                    │         evaluation/           │
                    │  ragas_eval.py                │
                    │  latency_eval.py              │
                    └───────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │       visualization/          │
                    │      leaderboard.py           │
                    └───────────────────────────────┘
                                    │
                                    ▼
                         results/leaderboard.csv
                         results/leaderboard.png

Retrieval Pipelines

Pipeline 1: BM25

Query → BM25 (rank-bm25) → Top-K Documents

Tokenizes corpus and queries, ranks documents by BM25 relevance scores.

Pipeline 2: Dense Retrieval

Query → BGE Embedding → FAISS Similarity Search → Top-K Documents

Encodes queries and documents with BAAI/bge-small-en-v1.5, searches a FAISS inner-product index with normalized embeddings.

Pipeline 3: Hybrid Retrieval

Query → BM25 Results ──┐
                       ├→ RRF Fusion → Top-K Documents
Query → Dense Results ─┘

Combines BM25 and dense ranked lists using standard Reciprocal Rank Fusion (RRF):

score(d) = Σ 1 / (k + rank(d))

Pipeline 4: Hybrid + Reranker

Query → Hybrid Retrieval → Top-20 Candidates → BGE Reranker → Top-K Documents

Retrieves candidates with hybrid search, then reranks with BAAI/bge-reranker-base cross-encoder. Expected to be the strongest pipeline.

Dataset Description

BEIR FiQA-2018 is a financial question-answering benchmark from the BEIR collection. It contains:

  • Financial forum posts and Stack Exchange questions as queries
  • Expert-annotated relevance judgments (qrels)
  • A corpus of financial documents

The dataset is automatically downloaded on first run and stored under data/fiqa/. The official BEIR test split is used. Query count is configurable via num_queries in config.yaml.

Installation

Prerequisites

  • Python 3.10+
  • Google Gemini API key

Setup

git clone <repository-url>
cd rag-beir-benchmark

python -m venv .venv
source .venv/bin/activate        # Linux/macOS
# .venv\Scripts\activate         # Windows

pip install -r requirements.txt

Set your Gemini API key:

export GOOGLE_API_KEY="your-api-key-here"   # Linux/macOS
# set GOOGLE_API_KEY=your-api-key-here      # Windows

Running Experiments

Run the full benchmark from the project root:

python experiments/run_all.py

This will:

  1. Download FiQA-2018 if missing
  2. Build BM25 and FAISS indexes (cached for subsequent runs)
  3. Run all four retrieval pipelines
  4. Generate answers with Gemini 2.5 Flash
  5. Evaluate with RAGAS metrics
  6. Save CSV results and leaderboard visualization

Configuration

All parameters are in config.yaml:

dataset:
  path: data/fiqa
  beir_dataset: fiqa
  split: test

top_k: 5
rerank_top_n: 20
num_queries: 50

embedding_model: BAAI/bge-small-en-v1.5
reranker_model: BAAI/bge-reranker-base
gemini_model: gemini-2.5-flash

Adjust num_queries to control evaluation subset size. Set to 0 or a large value to use all queries.

Evaluation Metrics

RAGAS Metrics

Metric Description
Context Precision Fraction of retrieved contexts that are relevant
Context Recall Fraction of relevant contexts that were retrieved
Faithfulness Whether the generated answer is grounded in retrieved context
Answer Relevancy How relevant the generated answer is to the question

Latency Metrics

Metric Description
Retrieval Latency Time spent retrieving documents only (seconds)
End-to-End Latency Time from query to final answer (seconds)

Sample Results

After a successful run, find outputs in results/:

results/
├── bm25_results.csv
├── dense_results.csv
├── hybrid_results.csv
├── reranker_results.csv
├── leaderboard.csv
└── leaderboard.png

Example leaderboard (illustrative):

Method Context Precision Context Recall Faithfulness Answer Relevancy Retrieval Latency End-to-End Latency
Hybrid + Reranker 0.72 0.68 0.85 0.79 0.45 1.82
Hybrid 0.65 0.61 0.78 0.74 0.12 1.49
Dense 0.58 0.55 0.71 0.68 0.08 1.45
BM25 0.51 0.48 0.64 0.61 0.02 1.39

Actual scores depend on query subset, API latency, and model versions.

Future Work

  • Query expansion — Add HyDE or multi-query retrieval to improve recall
  • Larger query sets — Scale evaluation to the full FiQA test set
  • Additional datasets — Extend benchmark to BEIR NFCorpus, SciFact, or custom corpora
  • Chunking strategies — Compare fixed-size vs. semantic chunking for long documents
  • Cost tracking — Log Gemini token usage and API cost per pipeline
  • Statistical significance — Add bootstrap confidence intervals for metric comparisons
  • Ablation studies — Isolate impact of RRF parameter k, rerank candidate count, and embedding model choice

Project Structure

rag-beir-benchmark/
├── config.yaml
├── requirements.txt
├── README.md
├── utils.py
├── data/
├── ingestion/
│   ├── load_beir.py
│   └── embeddings.py
├── retrieval/
│   ├── bm25.py
│   ├── dense.py
│   ├── hybrid.py
│   └── reranker.py
├── generation/
│   └── gemini_generator.py
├── evaluation/
│   ├── ragas_eval.py
│   └── latency_eval.py
├── experiments/
│   └── run_all.py
├── visualization/
│   └── leaderboard.py
└── results/

License

This project is intended for research and portfolio use. BEIR datasets are subject to their respective licenses. Model weights are governed by their Hugging Face model cards.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages