This notebook explores modern Information Retrieval with the MS MARCO dataset, ranking passages with BM25 and FAISS, refining with neural reranking, and comparing pipelines using PyTerrier and evaluation metrics.
Environment Setup
Java Installation for PyTerrier
Download and install OpenJDK
Full OpenJDK 17 Setup (PACE / Linux)
# 1️⃣ Create ~/bin if it doesn't existcd ~mkdir -p ~/bincd ~/bin# 2️⃣ Download full OpenJDK 17 (Adoptium/Temurin)wget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.8+7/OpenJDK17U-jdk_x64_linux_hotspot_17.0.8_7.tar.gz# 3️⃣ Extracttar -xzf OpenJDK17U-jdk_x64_linux_hotspot_17.0.8_7.tar.gzmv jdk-17.0.8+7 jdk-17# 4️⃣ Set JAVA_HOME for this sessionexport JAVA_HOME=$HOME/bin/jdk-17export PATH=$JAVA_HOME/bin:$PATHexport JVM_PATH=$JAVA_HOME/lib/server/libjvm.soexport LD_LIBRARY_PATH=$JAVA_HOME/lib/server:$LD_LIBRARY_PATH# 5️⃣ Make it permanent (add to ~/.bashrc)echo 'export JAVA_HOME=$HOME/bin/jdk-17' >> ~/.bashrcecho 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrcecho 'export JVM_PATH=$HOME/bin/jdk-17/lib/server/libjvm.so' >> ~/.bashrcecho 'export LD_LIBRARY_PATH=$HOME/bin/jdk-17/lib/server:$LD_LIBRARY_PATH' >> ~/.bashrc# 6️⃣ Reload shell to apply changesexec bash# 7️⃣ Verify installationjava --versionjavac --versionwhich javawhich javac
# create a directory to store the environmentmkdir -p ~/scratch/dsgt-arc/ir# create a virtual environment using uvuv venv ~/scratch/dsgt-arc/ir/.venv
Create alias in ~/.bashrc
alias activate='source ~/scratch/dsgt-arc/ir/.venv/bin/activate'
Activate it
# Reload bashrcsource ~/.bashrc# Activate the environmentactivate# verifywhich python
Link the scratch .venv with the ir project .venv and Install dependencies using pyproject.toml
Link .venv and install dependencies
# navigate to project directorycd ~/dsgt-arc/fall-2025-interest-group-projects/user/ctio3/# Delete existing symlinkrm .venv# create a symbolic link to your environment in this project folderln -s ~/scratch/dsgt-arc/ir/.venv $(pwd)/.venv# verifyls -l .venv# Install project dependencies defined in pyproject.tomlcd ~/dsgt-arc/fall-2025-interest-group-projects/project/03-information-retrieval# Check dependencies neededcat pyproject.toml# Installuv pip install -e .
To delete the .venv
# leave the venvdeactivate# delete it completelyrm -rf /storage/scratch1/1/ctio3/dsgt-arc/ir/.venv
Running the Jupyter Notebook
Register Jupyter kernel
python -m ipykernel install --user --name ir --display-name "DSGT ir (scratch)"# Verifyjupyter kernelspec list
Use SSH port forwarding to expose the remote Jupyter server to the local browser
# One-liner to SSH into compute node, forward port 8889 to local machine, activate the `.venv`, and launch Jupyterssh -L 8889:localhost:8889 pace-interactive 'source ~/scratch/dsgt-arc/ir/.venv/bin/activate && jupyter lab --no-browser --port=8889'
# Load sentence transformer model for embeddings# This model is specifically trained for question-answering retrieval!model_name = "sentence-transformers/all-mpnet-base-v2"print(f"Loading embedding model: {model_name}")print("This may take a minute on first run...\n")# Load modelembedding_model = SentenceTransformer(model_name, device=device)print("Model loaded successfully!")print(f"\n{'=' * 60}")print("MODEL INFORMATION")print(f"{'=' * 60}")# Model infoprint(f"Model: {model_name}")print(f"Embedding Dimension: {embedding_model.get_sentence_embedding_dimension()}")print(f"Max Sequence Length: {embedding_model.max_seq_length} tokens")print(f"Device: {embedding_model.device}")# Count parameterstotal_params = sum(p.numel() for p in embedding_model.parameters())print(f"Total Parameters: {total_params:,}")print(f"Model Size: ~{total_params * 4 / (1024**2):.1f} MB")print(f"{'=' * 60}\n")
============================================================
MODEL INFORMATION
============================================================
Model: sentence-transformers/all-mpnet-base-v2
Embedding Dimension: 768
Max Sequence Length: 384 tokens
Device: cuda:0
Total Parameters: 109,486,464
Model Size: ~417.7 MB
============================================================
# Test the embedding model with sample textsstest_query = "how do airplanes fly"test_passages = [ "Airplanes fly because of lift generated by air flowing over wings.", "The principles of aerodynamics explain how aircraft achieve flight.", "Pizza is a popular Italian food with cheese and tomato sauce.", "Python is a programming language used for data science.",]print(f"Test Query: '{test_query}'\n")# Encodequery_emb = embedding_model.encode([test_query], convert_to_tensor=True)passage_embs = embedding_model.encode(test_passages, convert_to_tensor=True)# Compute similaritiessimilarities = torch.nn.functional.cosine_similarity(query_emb, passage_embs)print("Similarity Scores (higher = more relevant):\n")for i, (passage, score) in enumerate( sorted(zip(test_passages, similarities), key=lambda x: x[1], reverse=True), 1): bar = "█" * int(score * 50) print(f"{i}. Score: {score:.4f} {bar}") print(f" {passage}\n")print( "Notice: Semantically related passages get higher scores, even without word overlap!")
Test Query: 'how do airplanes fly'
Similarity Scores (higher = more relevant):
1. Score: 0.8020 ████████████████████████████████████████
Airplanes fly because of lift generated by air flowing over wings.
2. Score: 0.7006 ███████████████████████████████████
The principles of aerodynamics explain how aircraft achieve flight.
3. Score: 0.0608 ███
Python is a programming language used for data science.
4. Score: 0.0376 █
Pizza is a popular Italian food with cheese and tomato sauce.
Notice: Semantically related passages get higher scores, even without word overlap!
# Encode all passages into embeddings# This is the computationally expensive part! will take ~1min on GPU and much longer on CPU...print(f"Encoding {len(passages):,} passages...")batch_size = 32 # Adjust based on your GPU memory# Create passage embeddingspassage_embeddings = embedding_model.encode( passages, batch_size=batch_size, show_progress_bar=True, convert_to_tensor=False, # Return numpy for FAISS normalize_embeddings=True, # Normalize for cosine similarity)print("\nPassage embeddings created!")print(f"Shape: {passage_embeddings.shape}")
print("Reuilding FAISS index...")# Get embedding dimensionembedding_dim = passage_embeddings.shape[1]# Create FAISS index# IndexFlatIP = Flat index with Inner Product (cosine similarity for normalized vectors)index = faiss.IndexFlatIP(embedding_dim)# Add passage embeddings to indexindex.add(passage_embeddings.astype("float32"))print("FAISS index built!")print(f"\n{'=' * 60}")print("INDEX INFORMATION")print(f"{'=' * 60}")print("Index Type: Flat (Exact Search)")print("Similarity Metric: Inner Product (Cosine for normalized vectors)")print(f"Embedding Dimension: {embedding_dim}")print(f"Number of Vectors: {index.ntotal:,}")print(f"Index Size: ~{passage_embeddings.nbytes / (1024**2):.1f} MB")print(f"Is Trained: {index.is_trained}")print(f"{'=' * 60}\n")print("The index is now ready for lightning-fast retrieval!")
Building FAISS index...
FAISS index built!
============================================================
INDEX INFORMATION
============================================================
Index Type: Flat (Exact Search)
Similarity Metric: Inner Product (Cosine for normalized vectors)
Embedding Dimension: 768
Number of Vectors: 82,193
Index Size: ~240.8 MB
Is Trained: True
============================================================
The index is now ready for lightning-fast retrieval!
# Show sample retrieval resultsnum_examples = 3sample_query_indices = random.sample(range(len(queries)), num_examples)for example_num, qidx in enumerate(sample_query_indices, 1): print("=" * 80) print(f"EXAMPLE {example_num}") print("=" * 80) print(f"\nQuery: {queries[qidx]}\n") print("Top-5 Retrieved Passages:\n") for rank, (score, pidx) in enumerate(zip(scores[qidx][:5], indices[qidx][:5]), 1): passage_text = passages[pidx] # Truncate long passages if len(passage_text) > 150: passage_text = passage_text[:150] + "..." print(f" Rank {rank} | Score: {score:.4f}") print(f" {passage_text}\n")
================================================================================
EXAMPLE 1
================================================================================
Query: what kind of organism is a black damsel
Top-5 Retrieved Passages:
Rank 1 | Score: 0.6407
Size and appearance. The largest scientifically measured Striped Damsel was 10.0 cm / 3.9 in. The Striped Damsel is white and adorned with three black...
Rank 2 | Score: 0.5679
Damselfishes comprise the family Pomacentridae except those of the genera Amphiprion and Premnas, which are the anemonefishes. They can grow up to 14 ...
Rank 3 | Score: 0.5603
Known by multiple common names, such as humbug damselfish, three-striped damselfish and white-tailed damselfish, Dascyllus aruanus is a feisty little ...
Rank 4 | Score: 0.5486
The 3-Stripe Damselfish, also known as the Three Striped Damselfish, White-tailed Damselfish, and Humbug Dascyllus, is a popular fish. Three bold blac...
Rank 5 | Score: 0.5313
1 Yellowtail Damsel Fish-The most popular damsel fish available, these naturally resilient fish are comparatively peaceful if kept in a shoal in a tan...
================================================================================
EXAMPLE 2
================================================================================
...
# Analyze score distributionsall_scores = scores.flatten()fig, axes = plt.subplots(1, 2, figsize=(16, 5))# Score distributionaxes[0].hist(all_scores, bins=50, color="steelblue", alpha=0.7, edgecolor="black")axes[0].set_xlabel("Similarity Score", fontsize=12)axes[0].set_ylabel("Frequency", fontsize=12)axes[0].set_title("Distribution of Retrieval Scores", fontsize=14, fontweight="bold")axes[0].axvline( np.mean(all_scores), color="red", linestyle="--", linewidth=2, label=f"Mean: {np.mean(all_scores):.3f}",)axes[0].legend()axes[0].grid(axis="y", alpha=0.3)# Top-1 scores vs ranktop_scores = scores[:, 0] # Best score for each queryaxes[1].hist(top_scores, bins=50, color="coral", alpha=0.7, edgecolor="black")axes[1].set_xlabel("Top-1 Similarity Score", fontsize=12)axes[1].set_ylabel("Frequency", fontsize=12)axes[1].set_title("Distribution of Best Match Scores", fontsize=14, fontweight="bold")axes[1].axvline( np.mean(top_scores), color="red", linestyle="--", linewidth=2, label=f"Mean: {np.mean(top_scores):.3f}",)axes[1].legend()axes[1].grid(axis="y", alpha=0.3)plt.tight_layout()plt.show()print("\nRetrieval Statistics:")print(f"Average similarity score: {np.mean(all_scores):.4f}")print(f"Average top-1 score: {np.mean(top_scores):.4f}")print(f"Min score: {np.min(all_scores):.4f}")print(f"Max score: {np.max(all_scores):.4f}")
Retrieval Statistics:
Average similarity score: 0.4298
Average top-1 score: 0.7800
Min score: 0.2015
Max score: 0.9398
Recreate FAISS retriever and re-run pipeline definition
# Recreate FAISS retrieverfaiss_pt = FAISSRetriever( index=index, passages=passages, passage_ids=passage_ids, embedding_model=embedding_model, k=100,)print("\nFAISS PyTerrier wrapper recreated!")print("Will retrieve top-{100} results per query")print("Can now be used in PyTerrier pipelines")# Rebuild retrieval pipelineprint("=" * 70)print("Updating Retrieval Pipelines")print("=" * 70)pipeline_sparse = bm25 % 100pipeline_dense = faiss_ptpipeline_sparse_rerank = (bm25 % 100) >> pt_rerankerpipeline_dense_rerank = faiss_pt >> pt_rerankerpipeline_hybrid_rerank = ( ((bm25 % 100) >> score_normalizer) + (faiss_pt >> score_normalizer)) >> pt_rerankerpipelines = { "1. Sparse (BM25)": pipeline_sparse, "2. Dense (FAISS)": pipeline_dense, "3. Sparse + Rerank": pipeline_sparse_rerank, "4. Dense + Rerank": pipeline_dense_rerank, "5. Hybrid + Rerank": pipeline_hybrid_rerank,}print("\nAll pipelines Rebuilt!\n")# Run all pipelines and collect resultsprint("=" * 70)print("Running All Pipelines")print("=" * 70)print(f"\nTesting on {len(test_queries_df)} queries\n")# Run each pipeline and store resultsresults_dict = {}for name, pipeline in pipelines.items(): print(f"Running: {name}") # Transform queries through pipeline results = pipeline.transform(test_queries_df) results_dict[name] = results print("Completed!")print("=" * 70)print("All pipelines executed successfully!")print("=" * 70)
FAISS PyTerrier wrapper recreated!
Will retrieve top-{100} results per query
Can now be used in PyTerrier pipelines
======================================================================
Updating Retrieval Pipelines
======================================================================
All pipelines Rebuilt!
======================================================================
Running All Pipelines
======================================================================
Testing on 100 queries
Running: 1. Sparse (BM25)
Completed!
Running: 2. Dense (FAISS)
Completed!
Running: 3. Sparse + Rerank
Completed!
Running: 4. Dense + Rerank
Completed!
Running: 5. Hybrid + Rerank
Completed!
======================================================================
All pipelines executed successfully!
======================================================================
Rerun evaluation
# Evaluate all pipelinesprint("=" * 70)print("STEP 5: Evaluating All Pipelines")print("=" * 70)# Get test query IDstest_qids = query_ids[:num_test_queries]# Calculate metrics for each pipelinek_values = [1, 5, 10]all_metrics = {}print(f"\nCalculating metrics for {len(pipelines)} pipelines...\n")for pipeline_name, results_df in results_dict.items(): print(f"Evaluating: {pipeline_name}") # Prepare retrieved documents retrieved = prepare_retrieved_for_eval(results_df, test_qids) # Calculate metrics at different K metrics = {} for k in k_values: mrr = calculate_mrr_with_qrels(test_qids, retrieved, qrels, k) recall = calculate_recall_at_k_with_qrels(test_qids, retrieved, qrels, k) precision = calculate_precision_at_k_with_qrels(test_qids, retrieved, qrels, k) metrics[f"MRR@{k}"] = mrr metrics[f"Recall@{k}"] = recall metrics[f"Precision@{k}"] = precision all_metrics[pipeline_name] = metrics print("Metrics calculated\n")print("=" * 70)print("Evaluation complete!")print("=" * 70)# Display comprehensive comparisonprint("=" * 70)print("FINAL COMPARISON: All Pipelines")print("=" * 70)# Create comparison table# Build comparison DataFramecomparison_data = []for pipeline_name, metrics in all_metrics.items(): row = {"Pipeline": pipeline_name} row.update(metrics) comparison_data.append(row)comparison_df = pd.DataFrame(comparison_data)# Display tableprint("\nMETRICS COMPARISON TABLE")print("=" * 70)print(comparison_df.to_string(index=False))print("=" * 70)# Highlight best performersprint("\nBEST PERFORMERS:")print("-" * 70)for metric in ["MRR@10", "Recall@10", "Precision@10"]: best_idx = comparison_df[metric].idxmax() best_pipeline = comparison_df.loc[best_idx, "Pipeline"] best_value = comparison_df.loc[best_idx, metric] print(f"{metric:15s}: {best_pipeline:25s} ({best_value:.4f})")# Visualizationsfig, axes = plt.subplots(2, 3, figsize=(20, 12))fig.suptitle("Complete Pipeline Comparison", fontsize=18, fontweight="bold", y=0.995)pipeline_names = [ p.split(". ")[1] if ". " in p else p for p in comparison_df["Pipeline"]]colors = ["steelblue", "coral", "lightgreen", "orange", "purple"]# Plot each metricmetrics_to_plot = ["MRR@1", "MRR@5", "MRR@10", "Recall@10", "Precision@10"]positions = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1)]for metric, pos in zip(metrics_to_plot, positions): ax = axes[pos] values = comparison_df[metric].values bars = ax.bar( range(len(values)), values, color=colors, alpha=0.8, edgecolor="black" ) ax.set_ylabel(metric, fontsize=12, fontweight="bold") ax.set_title(f"{metric} Comparison", fontsize=13, fontweight="bold") ax.set_xticks(range(len(pipeline_names))) ax.set_xticklabels(pipeline_names, rotation=45, ha="right", fontsize=10) ax.grid(axis="y", alpha=0.3) ax.set_ylim(0, max(values) * 1.15) # Add value labels for bar in bars: height = bar.get_height() ax.text( bar.get_x() + bar.get_width() / 2.0, height, f"{height:.3f}", ha="center", va="bottom", fontsize=9, fontweight="bold", )# Summary subplotax_summary = axes[1, 2]ax_summary.axis("off")plt.tight_layout()plt.show()