Intro

Open in Google Colab

Open In Colab

This notebook explores modern Information Retrieval with the MS MARCO dataset, ranking passages with BM25 and FAISS, refining with neural reranking, and comparing pipelines using PyTerrier and evaluation metrics.

description

Environment Setup

Java Installation for PyTerrier

  • Download and install OpenJDK
openjdk 17.0.8 2023-07-18
OpenJDK Runtime Environment Temurin-17.0.8+7 (build 17.0.8+7)
OpenJDK 64-Bit Server VM Temurin-17.0.8+7 (build 17.0.8+7, mixed mode, sharing)
javac 17.0.8
  • Test PyTerrier works
python - << 'EOF'
import pyterrier as pt
pt.init()
print("PyTerrier initialized successfully!")
EOF

Set up the virtual environment in scratch

# create a directory to store the environment
mkdir -p ~/scratch/dsgt-arc/ir
 
# create a virtual environment using uv
uv venv ~/scratch/dsgt-arc/ir/.venv
  • Create alias in ~/.bashrc
alias activate='source ~/scratch/dsgt-arc/ir/.venv/bin/activate'
  • Activate it
# Reload bashrc
source ~/.bashrc
# Activate the environment
activate
# verify
which python
  • Link the scratch .venv with the ir project .venv and Install dependencies using pyproject.toml
  • To delete the .venv
# leave the venv
deactivate
 
# delete it completely
rm -rf /storage/scratch1/1/ctio3/dsgt-arc/ir/.venv

Running the Jupyter Notebook

  • Register Jupyter kernel
python -m ipykernel install --user --name ir --display-name "DSGT ir (scratch)"
# Verify
jupyter kernelspec list
  • Use SSH port forwarding to expose the remote Jupyter server to the local browser
#  One-liner to SSH into compute node, forward port 8889 to local machine, activate the `.venv`, and launch Jupyter
ssh -L 8889:localhost:8889 pace-interactive 'source ~/scratch/dsgt-arc/ir/.venv/bin/activate && jupyter lab --no-browser --port=8889'

Optional- Create a start Jupyter script

  • Create a start-pace-jupyter.sh script
  • Run the script
./start-pace-jupyter.sh

Other

  • List devices connected to PCI (Peripheral Component Interconnect) buses
lspci
...
d7:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 07)
d7:16.5 Performance counters: Intel Corporation Sky Lake-E DDRIO Registers (rev 07)
d7:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 07)
d7:17.1 Performance counters: Intel Corporation Sky Lake-E DDRIO Registers (rev 07)
d8:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
d8:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
d8:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
d8:00.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)

Information Retrieval: Dense Retrieval with FAISS and Neural Reranking

The Two-Stage Retrieval Pipeline

Stage 1: Fast candidate retrieval (FAISS/BM25) → top-100 results
Stage 2: Accurate reranking (CrossEncoder) → final top-10

Exercises

1. Try different embedding models (larger = better quality, slower)

Load the larger all-mpnet-base-v2 model

# Load sentence transformer model for embeddings
# This model is specifically trained for question-answering retrieval!
model_name = "sentence-transformers/all-mpnet-base-v2"
print(f"Loading embedding model: {model_name}")
print("This may take a minute on first run...\n")
 
# Load model
embedding_model = SentenceTransformer(model_name, device=device)
 
print("Model loaded successfully!")
print(f"\n{'=' * 60}")
print("MODEL INFORMATION")
print(f"{'=' * 60}")
 
# Model info
print(f"Model: {model_name}")
print(f"Embedding Dimension: {embedding_model.get_sentence_embedding_dimension()}")
print(f"Max Sequence Length: {embedding_model.max_seq_length} tokens")
print(f"Device: {embedding_model.device}")
 
# Count parameters
total_params = sum(p.numel() for p in embedding_model.parameters())
print(f"Total Parameters: {total_params:,}")
print(f"Model Size: ~{total_params * 4 / (1024**2):.1f} MB")
 
print(f"{'=' * 60}\n")
============================================================
MODEL INFORMATION
============================================================
Model: sentence-transformers/all-mpnet-base-v2
Embedding Dimension: 768
Max Sequence Length: 384 tokens
Device: cuda:0
Total Parameters: 109,486,464
Model Size: ~417.7 MB
============================================================
# Test the embedding model with sample textss
test_query = "how do airplanes fly"
test_passages = [
    "Airplanes fly because of lift generated by air flowing over wings.",
    "The principles of aerodynamics explain how aircraft achieve flight.",
    "Pizza is a popular Italian food with cheese and tomato sauce.",
    "Python is a programming language used for data science.",
]
 
print(f"Test Query: '{test_query}'\n")
 
# Encode
query_emb = embedding_model.encode([test_query], convert_to_tensor=True)
passage_embs = embedding_model.encode(test_passages, convert_to_tensor=True)
 
# Compute similarities
similarities = torch.nn.functional.cosine_similarity(query_emb, passage_embs)
 
print("Similarity Scores (higher = more relevant):\n")
for i, (passage, score) in enumerate(
    sorted(zip(test_passages, similarities), key=lambda x: x[1], reverse=True), 1
):
    bar = "" * int(score * 50)
    print(f"{i}. Score: {score:.4f} {bar}")
    print(f"   {passage}\n")
 
print(
    "Notice: Semantically related passages get higher scores, even without word overlap!"
)
Test Query: 'how do airplanes fly'

Similarity Scores (higher = more relevant):

1. Score: 0.8020 ████████████████████████████████████████
   Airplanes fly because of lift generated by air flowing over wings.

2. Score: 0.7006 ███████████████████████████████████
   The principles of aerodynamics explain how aircraft achieve flight.

3. Score: 0.0608 ███
   Python is a programming language used for data science.

4. Score: 0.0376 █
   Pizza is a popular Italian food with cheese and tomato sauce.

Notice: Semantically related passages get higher scores, even without word overlap!
# Encode all passages into embeddings
# This is the computationally expensive part! will take ~1min on GPU and much longer on CPU...
print(f"Encoding {len(passages):,} passages...")
 
batch_size = 32  # Adjust based on your GPU memory
 
# Create passage embeddings
passage_embeddings = embedding_model.encode(
    passages,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_tensor=False,  # Return numpy for FAISS
    normalize_embeddings=True,  # Normalize for cosine similarity
)
print("\nPassage embeddings created!")
print(f"Shape: {passage_embeddings.shape}")
Encoding 82,193 passages...
Batches: 100%
 2569/2569 [10:25<00:00, 13.30it/s]

Passage embeddings created!
Shape: (82193, 768)
# Encode all queries into embeddings
print(f"Encoding {len(queries):,} queries...")
 
query_embeddings = embedding_model.encode(
    queries,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_tensor=False,
    normalize_embeddings=True,
)
print("\nQuery embeddings created!")
print(f"Shape: {query_embeddings.shape}")
Encoding 500 queries...
Batches: 100%
 16/16 [00:00<00:00, 22.96it/s]

Query embeddings created!
Shape: (500, 768)

Reuild FAISS index

print("Reuilding FAISS index...")
 
# Get embedding dimension
embedding_dim = passage_embeddings.shape[1]
 
# Create FAISS index
# IndexFlatIP = Flat index with Inner Product (cosine similarity for normalized vectors)
index = faiss.IndexFlatIP(embedding_dim)
 
# Add passage embeddings to index
index.add(passage_embeddings.astype("float32"))
 
print("FAISS index built!")
print(f"\n{'=' * 60}")
print("INDEX INFORMATION")
print(f"{'=' * 60}")
print("Index Type: Flat (Exact Search)")
print("Similarity Metric: Inner Product (Cosine for normalized vectors)")
print(f"Embedding Dimension: {embedding_dim}")
print(f"Number of Vectors: {index.ntotal:,}")
print(f"Index Size: ~{passage_embeddings.nbytes / (1024**2):.1f} MB")
print(f"Is Trained: {index.is_trained}")
print(f"{'=' * 60}\n")
print("The index is now ready for lightning-fast retrieval!")
Building FAISS index...
FAISS index built!

============================================================
INDEX INFORMATION
============================================================
Index Type: Flat (Exact Search)
Similarity Metric: Inner Product (Cosine for normalized vectors)
Embedding Dimension: 768
Number of Vectors: 82,193
Index Size: ~240.8 MB
Is Trained: True
============================================================

The index is now ready for lightning-fast retrieval!
# Perform first-stage retrieval
K = 100  # Retrieve top-100 passages per query
 
print("Performing first-stage retrieval...")
print(f"Retrieving top-{K} passages for {len(queries):,} queries...\n")
 
# Search
scores, indices = index.search(query_embeddings.astype("float32"), K)
 
print("Retrieval complete!")
print(f"Results shape: {indices.shape}")
print(f"Scores shape: {scores.shape}")
print(f"Total retrievals: {indices.shape[0] * indices.shape[1]:,}")
Performing first-stage retrieval...
Retrieving top-100 passages for 500 queries...

Retrieval complete!
Results shape: (500, 100)
Scores shape: (500, 100)
Total retrievals: 50,000
# Show sample retrieval results
num_examples = 3
sample_query_indices = random.sample(range(len(queries)), num_examples)
 
for example_num, qidx in enumerate(sample_query_indices, 1):
    print("=" * 80)
    print(f"EXAMPLE {example_num}")
    print("=" * 80)
    print(f"\nQuery: {queries[qidx]}\n")
    print("Top-5 Retrieved Passages:\n")
 
    for rank, (score, pidx) in enumerate(zip(scores[qidx][:5], indices[qidx][:5]), 1):
        passage_text = passages[pidx]
        # Truncate long passages
        if len(passage_text) > 150:
            passage_text = passage_text[:150] + "..."
        print(f"   Rank {rank} | Score: {score:.4f}")
        print(f"   {passage_text}\n")
================================================================================
EXAMPLE 1
================================================================================

Query: what kind of organism is a black damsel

Top-5 Retrieved Passages:

   Rank 1 | Score: 0.6407
   Size and appearance. The largest scientifically measured Striped Damsel was 10.0 cm / 3.9 in. The Striped Damsel is white and adorned with three black...

   Rank 2 | Score: 0.5679
   Damselfishes comprise the family Pomacentridae except those of the genera Amphiprion and Premnas, which are the anemonefishes. They can grow up to 14 ...

   Rank 3 | Score: 0.5603
   Known by multiple common names, such as humbug damselfish, three-striped damselfish and white-tailed damselfish, Dascyllus aruanus is a feisty little ...

   Rank 4 | Score: 0.5486
   The 3-Stripe Damselfish, also known as the Three Striped Damselfish, White-tailed Damselfish, and Humbug Dascyllus, is a popular fish. Three bold blac...

   Rank 5 | Score: 0.5313
   1 Yellowtail Damsel Fish-The most popular damsel fish available, these naturally resilient fish are comparatively peaceful if kept in a shoal in a tan...

================================================================================
EXAMPLE 2
================================================================================
...
# Analyze score distributions
all_scores = scores.flatten()
 
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
 
# Score distribution
axes[0].hist(all_scores, bins=50, color="steelblue", alpha=0.7, edgecolor="black")
axes[0].set_xlabel("Similarity Score", fontsize=12)
axes[0].set_ylabel("Frequency", fontsize=12)
axes[0].set_title("Distribution of Retrieval Scores", fontsize=14, fontweight="bold")
axes[0].axvline(
    np.mean(all_scores),
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {np.mean(all_scores):.3f}",
)
axes[0].legend()
axes[0].grid(axis="y", alpha=0.3)
 
# Top-1 scores vs rank
top_scores = scores[:, 0]  # Best score for each query
axes[1].hist(top_scores, bins=50, color="coral", alpha=0.7, edgecolor="black")
axes[1].set_xlabel("Top-1 Similarity Score", fontsize=12)
axes[1].set_ylabel("Frequency", fontsize=12)
axes[1].set_title("Distribution of Best Match Scores", fontsize=14, fontweight="bold")
axes[1].axvline(
    np.mean(top_scores),
    color="red",
    linestyle="--",
    linewidth=2,
    label=f"Mean: {np.mean(top_scores):.3f}",
)
axes[1].legend()
axes[1].grid(axis="y", alpha=0.3)
 
plt.tight_layout()
plt.show()
 
print("\nRetrieval Statistics:")
print(f"Average similarity score: {np.mean(all_scores):.4f}")
print(f"Average top-1 score: {np.mean(top_scores):.4f}")
print(f"Min score: {np.min(all_scores):.4f}")
print(f"Max score: {np.max(all_scores):.4f}")
Retrieval Statistics:
Average similarity score: 0.4298
Average top-1 score: 0.7800
Min score: 0.2015
Max score: 0.9398

Recreate FAISS retriever and re-run pipeline definition

# Recreate FAISS retriever
faiss_pt = FAISSRetriever(
    index=index,
    passages=passages,
    passage_ids=passage_ids,
    embedding_model=embedding_model,
    k=100,
)
 
print("\nFAISS PyTerrier wrapper recreated!")
print("Will retrieve top-{100} results per query")
print("Can now be used in PyTerrier pipelines")
 
 
 
# Rebuild retrieval pipeline
print("=" * 70)
print("Updating Retrieval Pipelines")
print("=" * 70)
 
pipeline_sparse = bm25 % 100
pipeline_dense = faiss_pt
pipeline_sparse_rerank = (bm25 % 100) >> pt_reranker
pipeline_dense_rerank = faiss_pt >> pt_reranker
pipeline_hybrid_rerank = (
    ((bm25 % 100) >> score_normalizer) + (faiss_pt >> score_normalizer)
) >> pt_reranker
 
pipelines = {
    "1. Sparse (BM25)": pipeline_sparse,
    "2. Dense (FAISS)": pipeline_dense,
    "3. Sparse + Rerank": pipeline_sparse_rerank,
    "4. Dense + Rerank": pipeline_dense_rerank,
    "5. Hybrid + Rerank": pipeline_hybrid_rerank,
}
 
print("\nAll pipelines Rebuilt!\n")
 
 
 
# Run all pipelines and collect results
print("=" * 70)
print("Running All Pipelines")
print("=" * 70)
 
print(f"\nTesting on {len(test_queries_df)} queries\n")
 
# Run each pipeline and store results
results_dict = {}
 
for name, pipeline in pipelines.items():
    print(f"Running: {name}")
    # Transform queries through pipeline
    results = pipeline.transform(test_queries_df)
    results_dict[name] = results
    print("Completed!")
 
print("=" * 70)
print("All pipelines executed successfully!")
print("=" * 70)
FAISS PyTerrier wrapper recreated!
Will retrieve top-{100} results per query
Can now be used in PyTerrier pipelines
======================================================================
Updating Retrieval Pipelines
======================================================================

All pipelines Rebuilt!

======================================================================
Running All Pipelines
======================================================================

Testing on 100 queries

Running: 1. Sparse (BM25)
Completed!
Running: 2. Dense (FAISS)
Completed!
Running: 3. Sparse + Rerank
Completed!
Running: 4. Dense + Rerank
Completed!
Running: 5. Hybrid + Rerank
Completed!
======================================================================
All pipelines executed successfully!
======================================================================

Rerun evaluation

# Evaluate all pipelines
print("=" * 70)
print("STEP 5: Evaluating All Pipelines")
print("=" * 70)
 
# Get test query IDs
test_qids = query_ids[:num_test_queries]
 
# Calculate metrics for each pipeline
k_values = [1, 5, 10]
all_metrics = {}
 
print(f"\nCalculating metrics for {len(pipelines)} pipelines...\n")
 
for pipeline_name, results_df in results_dict.items():
    print(f"Evaluating: {pipeline_name}")
 
    # Prepare retrieved documents
    retrieved = prepare_retrieved_for_eval(results_df, test_qids)
 
    # Calculate metrics at different K
    metrics = {}
    for k in k_values:
        mrr = calculate_mrr_with_qrels(test_qids, retrieved, qrels, k)
        recall = calculate_recall_at_k_with_qrels(test_qids, retrieved, qrels, k)
        precision = calculate_precision_at_k_with_qrels(test_qids, retrieved, qrels, k)
 
        metrics[f"MRR@{k}"] = mrr
        metrics[f"Recall@{k}"] = recall
        metrics[f"Precision@{k}"] = precision
 
    all_metrics[pipeline_name] = metrics
    print("Metrics calculated\n")
 
print("=" * 70)
print("Evaluation complete!")
print("=" * 70)
 
 
 
# Display comprehensive comparison
print("=" * 70)
print("FINAL COMPARISON: All Pipelines")
print("=" * 70)
 
# Create comparison table
# Build comparison DataFrame
comparison_data = []
for pipeline_name, metrics in all_metrics.items():
    row = {"Pipeline": pipeline_name}
    row.update(metrics)
    comparison_data.append(row)
 
comparison_df = pd.DataFrame(comparison_data)
 
# Display table
print("\nMETRICS COMPARISON TABLE")
print("=" * 70)
print(comparison_df.to_string(index=False))
print("=" * 70)
 
# Highlight best performers
print("\nBEST PERFORMERS:")
print("-" * 70)
for metric in ["MRR@10", "Recall@10", "Precision@10"]:
    best_idx = comparison_df[metric].idxmax()
    best_pipeline = comparison_df.loc[best_idx, "Pipeline"]
    best_value = comparison_df.loc[best_idx, metric]
    print(f"{metric:15s}: {best_pipeline:25s} ({best_value:.4f})")
 
# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(20, 12))
fig.suptitle("Complete Pipeline Comparison", fontsize=18, fontweight="bold", y=0.995)
 
pipeline_names = [
    p.split(". ")[1] if ". " in p else p for p in comparison_df["Pipeline"]
]
colors = ["steelblue", "coral", "lightgreen", "orange", "purple"]
 
# Plot each metric
metrics_to_plot = ["MRR@1", "MRR@5", "MRR@10", "Recall@10", "Precision@10"]
positions = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1)]
 
for metric, pos in zip(metrics_to_plot, positions):
    ax = axes[pos]
    values = comparison_df[metric].values
 
    bars = ax.bar(
        range(len(values)), values, color=colors, alpha=0.8, edgecolor="black"
    )
 
    ax.set_ylabel(metric, fontsize=12, fontweight="bold")
    ax.set_title(f"{metric} Comparison", fontsize=13, fontweight="bold")
    ax.set_xticks(range(len(pipeline_names)))
    ax.set_xticklabels(pipeline_names, rotation=45, ha="right", fontsize=10)
    ax.grid(axis="y", alpha=0.3)
    ax.set_ylim(0, max(values) * 1.15)
 
    # Add value labels
    for bar in bars:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2.0,
            height,
            f"{height:.3f}",
            ha="center",
            va="bottom",
            fontsize=9,
            fontweight="bold",
        )
 
# Summary subplot
ax_summary = axes[1, 2]
ax_summary.axis("off")
 
plt.tight_layout()
plt.show()
======================================================================
STEP 5: Evaluating All Pipelines
======================================================================

Calculating metrics for 5 pipelines...

Evaluating: 1. Sparse (BM25)
Metrics calculated

Evaluating: 2. Dense (FAISS)
Metrics calculated

Evaluating: 3. Sparse + Rerank
Metrics calculated

Evaluating: 4. Dense + Rerank
Metrics calculated

Evaluating: 5. Hybrid + Rerank
Metrics calculated

======================================================================
Evaluation complete!
======================================================================
======================================================================
FINAL COMPARISON: All Pipelines
======================================================================

METRICS COMPARISON TABLE
======================================================================
          Pipeline  MRR@1  Recall@1  Precision@1    MRR@5  Recall@5  Precision@5   MRR@10  Recall@10  Precision@10
  1. Sparse (BM25)   0.25  0.236395     0.255102 0.366833  0.581633     0.132653 0.390631   0.760204      0.087755
  2. Dense (FAISS)   0.34  0.312925     0.346939 0.504000  0.806122     0.179592 0.523036   0.933673      0.106122
3. Sparse + Rerank   0.40  0.369048     0.408163 0.556333  0.801020     0.181633 0.563536   0.877551      0.101020
 4. Dense + Rerank   0.42  0.389456     0.428571 0.581667  0.841837     0.189796 0.592548   0.948980      0.108163
5. Hybrid + Rerank   0.40  0.369048     0.408163 0.556333  0.801020     0.181633 0.563536   0.877551      0.101020
======================================================================

BEST PERFORMERS:
----------------------------------------------------------------------
MRR@10         : 4. Dense + Rerank         (0.5925)
Recall@10      : 4. Dense + Rerank         (0.9490)
Precision@10   : 4. Dense + Rerank         (0.1082)
description

The larger model improved Dense pipelines in all 3 metrics:

  • Base model:
MRR@10         : 4. Dense + Rerank         (0.5824)
Recall@10      : 4. Dense + Rerank         (0.9388)
Precision@10   : 4. Dense + Rerank         (0.1071)
  • Larger model:
MRR@10         : 4. Dense + Rerank         (0.5925)
Recall@10      : 4. Dense + Rerank         (0.9490)
Precision@10   : 4. Dense + Rerank         (0.1082)