Intro

  • Check if GPU is active
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                On  |   00000000:AF:00.0 Off |                  Off |
| 34%   30C    P8             16W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+---------------------------------------------------------------------------------------

Ollama

img

Ollama is an open-source platform designed to run large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more locally. It allows users to generate text, assist with coding, and create content privately and securely on their own devices

Prereqs

  • Add the following enviornment variables to ~/.bashrc to make sure dependencies are cached to scratch instead of home
# to make pip-installable tools available
export PATH=$HOME/.local/bin:$PATH
# to redirect uv and huggingface cache to scratch
export XDG_CACHE_HOME="$HOME/scratch/.cache"
# to redirect ollama models to scratch
export OLLAMA_MODELS=$HOME/scratch/ollama_models
  • Reload shell configuration
source ~/.bashrc

Running Ollama

  • Set up PACE interactive session with GPU support
salloc \
    --account=paceship-dsgt_clef2026 \
    --gres=gpu:1 \
    --constraint=RTX6000 \
    --qos=embers \
    --time=1:00:00
  • Make sure Ollama is running
module spider ollama
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  ollama:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        ollama/0.5.1
        ollama/0.6.6
        ollama/0.9.0
 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "ollama" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  • Load the ollama module and start the server
# in the serving terminal window
module load ollama
ollama serve
  • In another terminal, run the client do download and run a gemma3 model
# in the client terminal window
module load ollama
ollama run gemma3:4b
pulling manifest 
pulling aeda25e63ebd: 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 3.3 GB                         
pulling e0a42594d802: 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  358 B                         
pulling dd084c7d92a3: 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 8.4 KB                         
pulling 3116c5225075: 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   77 B                         
pulling b6ae5839783f: 100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  489 B                         
verifying sha256 digest 
writing manifest 
success 
>>> Send a message (/? for help)
  • You can now type messages, example:
>>> write a haiku about georgia tech
Yellow jackets soar,
Tech’s spirit, a brilliant hue,
Future leaders rise.
  • Verify server is running
curl http://localhost:11434/api/tags | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   329  100   329    0     0   107k      0 --:--:-- --:--:-- --:--:--  107k
{
  "models": [
    {
      "name": "gemma3:4b",
      "model": "gemma3:4b",
      "modified_at": "2025-12-21T00:08:40-05:00",
      "size": 3338801804,
      "digest": "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "gemma3",
        "families": [
          "gemma3"
        ],
        "parameter_size": "4.3B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}
  • Verify model was cached to scratch
tree ~/scratch/ollama_models/
/storage/home/hcoda1/1/ctio3/scratch/ollama_models/
β”œβ”€β”€ blobs
β”‚   β”œβ”€β”€ sha256-3116c52250752e00dd06b16382e952bd33c34fd79fc4fe3a5d2c77cf7de1b14b
β”‚   β”œβ”€β”€ sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25
β”‚   β”œβ”€β”€ sha256-b6ae5839783f2ba248e65e4b960ab15f9c4b7118db285827dba6cba9754759e2
β”‚   β”œβ”€β”€ sha256-dd084c7d92a3c1c14cc09ae77153b903fd2024b64a100a0cc8ec9316063d2dbc
β”‚   └── sha256-e0a42594d802e5d31cdc786deb4823edb8adff66094d49de8fffe976d753e348
└── manifests
    └── registry.ollama.ai
        └── library
            └── gemma3
                └── 4b

5 directories, 6 files

Benchmark Ollama

  • Benchmark ollama using a pre-built script
uv tool install https://github.com/dsgt-arc/ollama-benchmark.git
  • Run benchmark
ollama-benchmark --verbose --models gemma3:4b --prompts "who is george p burdell?"

Ollama script

  • Run script
sbatch ollama.sbatch
Submitted batch job 3016564
  • Check log output in logs/ollama
  • Verify job is running
squeue -u ctio3
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3016578 gpu-rtx60   ollama    ctio3  R       0:00      1 atl1-1-02-005-31-0
3016056 gpu-rtx60 sys/dash    ctio3  R      58:55      1 atl1-1-03-003-19-0

vLLM

  • Benchmark a model with vLLM
    • More specifically: run a throughput benchmark on the Qwen3-0.6B model, measuring how many tokens it can generate per second with inputs of 32 tokens, outputting only 1 token, and with a maximum context length of 1024 tokens
uvx vllm bench throughput \
    --model Qwen/Qwen3-0.6B \
    --input-len 32 \
    --output-len 1 \
    --max-model-len 1024
  • Serve the model (starts a local OpenAI-compatible inference server)
MODEL="Qwen/Qwen3-0.6B"
uvx vllm serve $MODEL
  • In another terminal, run the inference- send text to the model and show the response
curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"$MODEL\",\"prompt\":\"Hi\"}" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   507  100   481  100    26    600     32 --:--:-- --:--:-- --:--:--   632
{
  "id": "cmpl-b214bbf7385900cc",
  "object": "text_completion",
  "created": 1766302389,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "text": "Question\n\nThe given function is:\n\nf(x) = 3x^3",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1,
    "total_tokens": 17,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Exercises

Next