Intro

Embeddings

We will use DinoV2, a vision transformer model, to extract image embeddings from the Fashion MNIST dataset. These embeddings will be used to visualize image similarity via clustering and to perform similarity search using k-nearest neighbors (KNN).

img

Environment Setup

Temp Directory

  • The virtual environment will be created in the fast, temporary file system located at $TMPDIR
echo '# Set a temporary directory' >> ~/.bashrc
echo 'export TMPDIR=${TMPDIR:-/tmp}' >> ~/.bashrc
source ~/.bashrc
echo $TMPDIR
  • Create the virtual environment in $TMPDIR
uv venv $TMPDIR/.venv
  • Activate the environment
source $TMPDIR/.venv/bin/activate
  • Create an alias for activation
echo "alias activate='source \$TMPDIR/.venv/bin/activate'" >> ~/.bashrc
source ~/.bashrc
  • Navigate to the user/ directory and link dependencies
# cd user folder
 
# Create a symbolic link named .venv in your user folder, 
# pointing to the actual environment in $TMPDIR
ln -s $TMPDIR/.venv $(pwd)/.venv
 
# Install project dependencies defined in pyproject.toml
cd ~/dsgt-arc/fall-2025-interest-group-projects/project/02-embeddings
uv pip install -e .

Run the Jupyter Notebook

  • Activate the .venv
activate
  • Install ipykernel
uv pip install ipykernel
  • Register the kernel
python -m ipykernel install --user \
    --name dsgt-tmp \
    --display-name "DSGT test ($TMPDIR)"
  • Open a copy of embedding.ipynb in Jupyter and select the linked .venv kernel
jupyter notebook

Set up the virtual environment in scratch

  • Create a .venv in scratch
# create a directory to store the environment
mkdir -p ~/scratch/dsgt-arc/embeddings
 
# create a virtual environment using uv
uv venv ~/scratch/dsgt-arc/embeddings/.venv
  • Create alias in ~/.bashrc
alias activate='source ~/scratch/dsgt-arc/embeddings/.venv/bin/activate'
  • Activate it
activate
# verify
which python
~/scratch/dsgt-arc/embeddings/.venv/bin/python
  • Link the scratch .venv with the lora project .venv and Install dependencies using pyproject.toml
  • Register Jupyter kernel
python -m ipykernel install --user \
  --name embeddings \
  --display-name "DSGT Embeddings (scratch)"
ssh pace-interactive
jupyter lab --no-browser --ip=0.0.0.0 --port=8888

Create a start Jupyter script

  • Create a start-pace-jupyter.sh script
  • Make it executable
chmod +x start-pace-jupyter.sh
  • Run the script
./start-pace-jupyter.sh

Next

03-information-retrieval