Intro

Check if GPU is active

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                On  |   00000000:AF:00.0 Off |                  Off |
| 34%   30C    P8             16W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+---------------------------------------------------------------------------------------

Ollama

Ollama

Ollama is an open-source platform designed to run large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more locally. It allows users to generate text, assist with coding, and create content privately and securely on their own devices

Prereqs

Add the following enviornment variables to ~/.bashrc to make sure dependencies are cached to scratch instead of home

# to make pip-installable tools available
export PATH=$HOME/.local/bin:$PATH
# to redirect uv and huggingface cache to scratch
export XDG_CACHE_HOME="$HOME/scratch/.cache"
# to redirect ollama models to scratch
export OLLAMA_MODELS=$HOME/scratch/ollama_models

Reload shell configuration

source ~/.bashrc

Running Ollama

Set up PACE interactive session with GPU support

salloc \
    --account=paceship-dsgt_clef2026 \
    --gres=gpu:1 \
    --constraint=RTX6000 \
    --qos=embers \
    --time=1:00:00

Make sure Ollama is running

module spider ollama

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  ollama:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        ollama/0.5.1
        ollama/0.6.6
        ollama/0.9.0
 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "ollama" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.

Load the ollama module and start the server

# in the serving terminal window
module load ollama
ollama serve

In another terminal, run the client do download and run a gemma3 model

# in the client terminal window
module load ollama
ollama run gemma3:4b

pulling manifest 
pulling aeda25e63ebd: 100% ▕██████████████████████████████████████████████████████████████████████████████████████▏ 3.3 GB                         
pulling e0a42594d802: 100% ▕██████████████████████████████████████████████████████████████████████████████████████▏  358 B                         
pulling dd084c7d92a3: 100% ▕██████████████████████████████████████████████████████████████████████████████████████▏ 8.4 KB                         
pulling 3116c5225075: 100% ▕██████████████████████████████████████████████████████████████████████████████████████▏   77 B                         
pulling b6ae5839783f: 100% ▕██████████████████████████████████████████████████████████████████████████████████████▏  489 B                         
verifying sha256 digest 
writing manifest 
success 
>>> Send a message (/? for help)

You can now type messages, example:

>>> write a haiku about georgia tech
Yellow jackets soar,
Tech’s spirit, a brilliant hue,
Future leaders rise.

Verify server is running

curl http://localhost:11434/api/tags | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   329  100   329    0     0   107k      0 --:--:-- --:--:-- --:--:--  107k
{
  "models": [
    {
      "name": "gemma3:4b",
      "model": "gemma3:4b",
      "modified_at": "2025-12-21T00:08:40-05:00",
      "size": 3338801804,
      "digest": "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "gemma3",
        "families": [
          "gemma3"
        ],
        "parameter_size": "4.3B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

Verify model was cached to scratch

tree ~/scratch/ollama_models/

/storage/home/hcoda1/1/ctio3/scratch/ollama_models/
├── blobs
│   ├── sha256-3116c52250752e00dd06b16382e952bd33c34fd79fc4fe3a5d2c77cf7de1b14b
│   ├── sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25
│   ├── sha256-b6ae5839783f2ba248e65e4b960ab15f9c4b7118db285827dba6cba9754759e2
│   ├── sha256-dd084c7d92a3c1c14cc09ae77153b903fd2024b64a100a0cc8ec9316063d2dbc
│   └── sha256-e0a42594d802e5d31cdc786deb4823edb8adff66094d49de8fffe976d753e348
└── manifests
    └── registry.ollama.ai
        └── library
            └── gemma3
                └── 4b

5 directories, 6 files

Benchmark Ollama

Benchmark ollama using a pre-built script

uv tool install https://github.com/dsgt-arc/ollama-benchmark.git

Run benchmark

ollama-benchmark --verbose --models gemma3:4b --prompts "who is george p burdell?"

Benchmarking Ollama

Verbose: True
Test models: ['gemma3:4b']
Prompts: ['who is george p burdell?']
Evaluating models: ['gemma3:4b']
 
 
 
Benchmarking: gemma3:4b
Prompt: who is george p burdell?
George P. Burdell (1924 - 2013) was a fascinating and remarkably long-lived American man who gained notoriety for his **extraordinary longevity and his claims of being immortal.** He became a minor celebrity and subject of numerous books, documentaries, and investigations. Here's a breakdown of what's known about him:
 
**Key Facts & Claims:**
 
* **Born in 1924:** Burdell was born in 1924, making him incredibly old by the time he died in 2013 at the age of 89.
* **"Immortal" Claims:** This is where things get peculiar. Starting in the 1990s, Burdell began making bold assertions that he was essentially immortal, that he had been alive for centuries, and that he had witnessed historical events like the Crusades, the building of the pyramids, and the Roman Empire. He even claimed to have been a messenger for Leonardo da Vinci.
* **Detailed Stories:** Burdell possessed a detailed, almost encyclopedic, knowledge of history, geography, and various cultures, far exceeding what would be expected of someone of his age. He could describe events in great detail, often with specific names and dates.
* **His "Memory":** He attributed his incredible knowledge to a "memory" that extended back to his birth. He stated that he simply remembered everything that had happened in his life.
* **Evidence & Skepticism:**  His claims were met with widespread skepticism.  There was no independent verification of his stories, and no evidence to support his extraordinary life story.  Many suspected he was a master storyteller, possibly with a history of mental health issues or memory problems. 
 
**Possible Explanations & Theories:**
 
* **Factitious Disorder/Malingering:** The most common explanation offered by skeptics is that Burdell was suffering from a psychiatric condition, such as Factitious Disorder (Munchausen's syndrome), where he deliberately fabricated a false illness or life story for attention.
* **Highly Developed Imagination & Storytelling Skills:** Some researchers suggested he was simply an incredibly skilled storyteller, possessing a vivid imagination and a talent for mimicking historical details he had learned through books and research.
* **Misinformation & Confusion:** It’s possible that over time, Burdell's recollections became increasingly distorted and confused.
* **A Combination of Factors:** It’s likely that a combination of factors contributed to his claims, including his capacity for storytelling, potential psychological issues, and the passage of time.
 
 
**Notable Works About Him:**
 
* **"The Man Who Fell Into Time" (2006) by Jonathan Green:**  This book is considered the most comprehensive and detailed account of Burdell’s life and claims.  It’s based on extensive interviews with him.
* **Documentaries:** Several documentaries have been made about Burdell, exploring his story and the various theories surrounding it.
 
**Resources for More Information:**
 
* **The Man Who Fell Into Time (Website):** [https://www.manwhofellintime.com/](https://www.manwhofellintime.com/)
* **Wikipedia:** [https://en.wikipedia.org/wiki/George_P._Burdell](https://en.wikipedia.org/wiki/George_P._Burdell)
 
 
Ultimately, George P. Burdell remains a fascinating enigma. While his claims of immortality and historical experience are highly suspect, his story continues to intrigue and raise questions about memory, truth, and the human capacity for storytelling.
 
Do you want me to delve deeper into any specific aspect of his story, such as:
 
*   The details of his claimed historical experiences?
*   The different theories about his background?
*   The process of creating "The Man Who Fell Into Time" book?
Response: George P. Burdell (1924 - 2013) was a fascinating and remarkably long-lived American man who gained notoriety for his **extraordinary longevity and his claims of being immortal.** He became a minor celebrity and subject of numerous books, documentaries, and investigations. Here's a breakdown of what's known about him:
 
**Key Facts & Claims:**
 
* **Born in 1924:** Burdell was born in 1924, making him incredibly old by the time he died in 2013 at the age of 89.
* **"Immortal" Claims:** This is where things get peculiar. Starting in the 1990s, Burdell began making bold assertions that he was essentially immortal, that he had been alive for centuries, and that he had witnessed historical events like the Crusades, the building of the pyramids, and the Roman Empire. He even claimed to have been a messenger for Leonardo da Vinci.
* **Detailed Stories:** Burdell possessed a detailed, almost encyclopedic, knowledge of history, geography, and various cultures, far exceeding what would be expected of someone of his age. He could describe events in great detail, often with specific names and dates.
* **His "Memory":** He attributed his incredible knowledge to a "memory" that extended back to his birth. He stated that he simply remembered everything that had happened in his life.
* **Evidence & Skepticism:**  His claims were met with widespread skepticism.  There was no independent verification of his stories, and no evidence to support his extraordinary life story.  Many suspected he was a master storyteller, possibly with a history of mental health issues or memory problems. 
 
**Possible Explanations & Theories:**
 
* **Factitious Disorder/Malingering:** The most common explanation offered by skeptics is that Burdell was suffering from a psychiatric condition, such as Factitious Disorder (Munchausen's syndrome), where he deliberately fabricated a false illness or life story for attention.
* **Highly Developed Imagination & Storytelling Skills:** Some researchers suggested he was simply an incredibly skilled storyteller, possessing a vivid imagination and a talent for mimicking historical details he had learned through books and research.
* **Misinformation & Confusion:** It’s possible that over time, Burdell's recollections became increasingly distorted and confused.
* **A Combination of Factors:** It’s likely that a combination of factors contributed to his claims, including his capacity for storytelling, potential psychological issues, and the passage of time.
 
 
**Notable Works About Him:**
 
* **"The Man Who Fell Into Time" (2006) by Jonathan Green:**  This book is considered the most comprehensive and detailed account of Burdell’s life and claims.  It’s based on extensive interviews with him.
* **Documentaries:** Several documentaries have been made about Burdell, exploring his story and the various theories surrounding it.
 
**Resources for More Information:**
 
* **The Man Who Fell Into Time (Website):** [https://www.manwhofellintime.com/](https://www.manwhofellintime.com/)
* **Wikipedia:** [https://en.wikipedia.org/wiki/George_P._Burdell](https://en.wikipedia.org/wiki/George_P._Burdell)
 
 
Ultimately, George P. Burdell remains a fascinating enigma. While his claims of immortality and historical experience are highly suspect, his story continues to intrigue and raise questions about memory, truth, and the human capacity for storytelling.
 
Do you want me to delve deeper into any specific aspect of his story, such as:
 
*   The details of his claimed historical experiences?
*   The different theories about his background?
*   The process of creating "The Man Who Fell Into Time" book?
 
----------------------------------------------------
        Model: gemma3:4b
        Performance Metrics:
            Prompt Processing:  1651.27 tokens/sec
            Generation Speed:   93.36 tokens/sec
            Combined Speed:     95.22 tokens/sec
 
        Workload Stats:
            Input Tokens:       16
            Generated Tokens:   755
            Model Load Time:    0.06s
            Processing Time:    0.01s
            Generation Time:    8.09s
            Total Time:         8.16s
----------------------------------------------------
        
Average stats:
 
----------------------------------------------------
        Model: gemma3:4b
        Performance Metrics:
            Prompt Processing:  1651.27 tokens/sec
            Generation Speed:   93.36 tokens/sec
            Combined Speed:     95.22 tokens/sec
 
        Workload Stats:
            Input Tokens:       16
            Generated Tokens:   755
            Model Load Time:    0.06s
            Processing Time:    0.01s
            Generation Time:    8.09s
            Total Time:         8.16s
----------------------------------------------------

Ollama script

Run script

sbatch ollama.sbatch

Submitted batch job 3016564

Check log output in logs/ollama

log

---------------------------------------
Begin Slurm Prolog: Dec-21-2025 00:54:07
Job ID:    3016564
User ID:   ctio3
Account:   paceship-dsgt_clef2026
Job name:  ollama
Partition: gpu-rtx6000
QOS:       embers
---------------------------------------
+ main
+ hostname
atl1-1-02-004-33-0.pace.gatech.edu
+ nvidia-smi
Sun Dec 21 00:54:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                On  |   00000000:3B:00.0 Off |                  Off |
| 33%   28C    P8              5W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ module load ollama
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_sh_dbg=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output
Shell debugging restarted
+ unset __lmod_sh_dbg
+ return 0
+ MODEL=gemma3:270m
++ find_free_port
+++ python3 -c 'import socket; s=socket.socket(); s.bind(('\'''\'',0)); print(s.getsockname()[1]); s.close()'
++ local port=45765
++ lsof -i:45765
++ echo 45765
+ PORT=45765
+ export OLLAMA_HOST=localhost:45765
+ OLLAMA_HOST=localhost:45765
+ SERVER_PID=1587307
+ trap 'echo '\''Killing server pid 1587307'\''; kill 1587307' EXIT
+ wait_for_ollama_host localhost:45765
+ ollama serve
+ local host=localhost:45765
+ curl --retry 5 --retry-connrefused --retry-delay 0 -sf localhost:45765
> > time=2025-12-21T00:54:09.128-05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://localhost:45765 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/storage/home/hcoda1/1/ctio3/scratch/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false OCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-12-21T00:54:09.139-05:00 level=INFO source=images.go:479 msg="total blobs: 5"
time=2025-12-21T00:54:09.140-05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-12-21T00:54:09.142-05:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:45765 (version 0.9.0)"
time=2025-12-21T00:54:09.142-05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
> time=2025-12-21T00:54:09.536-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1560f2db-2323-b99b-365e-50480c1c4571 library=cuda variant=v12 compute=7.5 river=12.9 name="Quadro RTX 6000" total="23.5 GiB" available="23.3 GiB"
[GIN] 2025/12/21 - 00:54:10 | 200 |      63.003µs |       127.0.0.1 | GET      "/"
Ollama is running+ ollama pull gemma3:270m
[GIN] 2025/12/21 - 00:54:10 | 200 |      32.901µs |       127.0.0.1 | HEAD     "/"
> [?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026ltime=2025-12-21T00:54:10.622-05:00 level=INFO source=download.go:177 msg="downloading 735af2139dc6 in 3 100 MB art(s)"
> [?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 735af2139dc6:   0% ▕                  ▏ 1.2 MB/291 MB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
...
pulling 74156d92caf6: 100% ▕██████████████████▏  490 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l
+ wait_for_completion_api 45765 gemma3:270m
+ local port=45765
+ local model=gemma3:270m
+ echo 'Waiting for API...'
Waiting for API...
+ for i in {1..50}
+ grep -q 200
> + curl -s -w '%{http_code}' -o /dev/null http://localhost:45765/v1/completions -H 'Content-Type: application/json' -d '{"model":"gemma3:270m","prompt":"Hi","max_tokens":1,temperature":0}'
> time=2025-12-21T00:54:19.643-05:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/storage/home/hcoda1/1/ctio3/scratch/ollama_models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 gpu=GPU-1560f2db-2323-b99b-365e-50480c1c4571 parallel=2 available=25015746560 equired="1.3 GiB"
time=2025-12-21T00:54:19.866-05:00 level=INFO source=server.go:135 msg="system memory" total="376.0 GiB" free="362.7 GiB" free_swap="8.0 GiB"
> time=2025-12-21T00:54:19.866-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="46.5 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="271.8 MiB" memory.weights.repeating="101.8 MiB" memory.weights.nonrepeating="170.0 MiB" memory.graph.full="513.2 MiB" memory.graph.partial="644.5 iB"
> time=2025-12-21T00:54:19.920-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/storage/pace-apps/manual/packages/ollama/0.9.0/bin/ollama runner --ollama-engine --model /storage/home/hcoda1/1/ctio3/scratch/ollama_models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 --ctx-size 8192 -batch-size 512 --n-gpu-layers 19 --threads 24 --parallel 2 --port 41071"
time=2025-12-21T00:54:19.923-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-12-21T00:54:19.923-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-12-21T00:54:19.924-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-12-21T00:54:19.938-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-12-21T00:54:19.938-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41071"
time=2025-12-21T00:54:19.988-05:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q8_0 name="" description="" num_tensors=236 num_key_values=37
load_backend: loaded CPU backend from /storage/pace-apps/manual/packages/ollama/0.9.0/lib/ollama/libggml-cpu-skylakex.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /storage/pace-apps/manual/packages/ollama/0.9.0/lib/ollama/cuda_v12/libggml-cuda.so
> time=2025-12-21T00:54:20.078-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.VX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-12-21T00:54:20.175-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-12-21T00:54:20.227-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="170.0 MiB"
time=2025-12-21T00:54:20.227-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="271.9 MiB"
time=2025-12-21T00:54:20.247-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="89.5 MiB"
time=2025-12-21T00:54:20.247-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="1.2 MiB"
time=2025-12-21T00:54:20.426-05:00 level=INFO source=server.go:630 msg="llama runner started in 0.50 seconds"
[GIN] 2025/12/21 - 00:54:20 | 200 |  1.311360013s |       127.0.0.1 | POST     "/v1/completions"
+ echo 'API ready'
API ready
+ return 0
+ echo 'running inference via ollama frontend'
running inference via ollama frontend
+ ollama run gemma3:270m 'write a haiku about georgia tech'
[GIN] 2025/12/21 - 00:54:20 | 200 |      27.451µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/12/21 - 00:54:20 | 200 |    76.71819ms |       127.0.0.1 | POST     "/api/show"
Golden sunbeams gleam,
Silicon dreams in the air,
A future bright and bold.
[GIN] 2025/12/21 - 00:54:20 | 200 |  166.228481ms |       127.0.0.1 | POST     "/api/generate"
 
 
+ echo 'running inference via openai api frontend'
running inference via openai api frontend
+ python_openai_query 45765 gemma3:270m 'write a haiku about georgia tech'
+ local port=45765
+ local model=gemma3:270m
+ local 'prompt=write a haiku about georgia tech'
+ uv run -
/var/lib/slurm/slurmd/job3016564/slurm_script: line 50: uv: command not found
+ echo 'Killing server pid 1587307'
Killing server pid 1587307
+ kill 1587307
---------------------------------------
Begin Slurm Epilog: Dec-21-2025 00:54:21
Job ID:        3016564
User ID:       ctio3
Account:       paceship-dsgt_clef2026
Job name:      ollama
Resources:     cpu=6,gres/gpu:rtx_6000=1,mem=24G,node=1
Rsrc Used:     cput=00:01:42,vmem=0,walltime=00:00:17,mem=870188K,energy_used=0
Partition:     gpu-rtx6000
QOS:           embers
Nodes:         atl1-1-02-004-33-0
---------------------------------------

Verify job is running

squeue -u ctio3

    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3016578 gpu-rtx60   ollama    ctio3  R       0:00      1 atl1-1-02-005-31-0
3016056 gpu-rtx60 sys/dash    ctio3  R      58:55      1 atl1-1-03-003-19-0

vLLM

Benchmark a model with vLLM
- More specifically: run a throughput benchmark on the Qwen3-0.6B model, measuring how many tokens it can generate per second with inputs of 32 tokens, outputting only 1 token, and with a maximum context length of 1024 tokens

uvx vllm bench throughput \
    --model Qwen/Qwen3-0.6B \
    --input-len 32 \
    --output-len 1 \
    --max-model-len 1024

Serve the model (starts a local OpenAI-compatible inference server)

MODEL="Qwen/Qwen3-0.6B"
uvx vllm serve $MODEL

In another terminal, run the inference- send text to the model and show the response

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d "{\"model\":\"$MODEL\",\"prompt\":\"Hi\"}" | jq

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   507  100   481  100    26    600     32 --:--:-- --:--:-- --:--:--   632
{
  "id": "cmpl-b214bbf7385900cc",
  "object": "text_completion",
  "created": 1766302389,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "text": "Question\n\nThe given function is:\n\nf(x) = 3x^3",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1,
    "total_tokens": 17,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Exercises

What is the largest model in the Gemma3 family that you can run with ollama on a RTX6000 GPU?

Nvidia Quadro RTX 6000 has 24.6 GB avaialble. From the gemma3 model list https://ollama.com/library/gemma3/tags, gemma3:27b-it-qat is the largest model (18GB) you can run on this GPU.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                On  |   00000000:AF:00.0 Off |                  Off |
| 33%   33C    P2             62W /  260W |    4976MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1125743      C   ...kages/ollama/0.9.0/bin/ollama       4972MiB |
+-----------------------------------------------------------------------------------------+

Compare the performance metrics of a model family of your choice (e.g. gemma3 or phi4) across parameter sizes and quantization levels

When benchmarking gemma3 using speed (tokens/sec), the speed decreases with higher parameter sizes. For the same parameter size, inference speed is similar across quantization levels.

Model: gemma3:1b
Performance Metrics:
    Prompt Processing:  1884.80 tokens/sec
    Generation Speed:   157.97 tokens/sec
    Combined Speed:     161.77 tokens/sec


Model: gemma3:1b-it-qat
Performance Metrics:
    Prompt Processing:  1463.93 tokens/sec
    Generation Speed:   144.38 tokens/sec
    Combined Speed:     146.82 tokens/sec


Model: gemma3:27b
Performance Metrics:
    Prompt Processing:  431.83 tokens/sec
    Generation Speed:   25.03 tokens/sec
    Combined Speed:     25.49 tokens/sec

Model: gemma3:27b-it-qat
Performance Metrics:
    Prompt Processing:  871.33 tokens/sec
    Generation Speed:   25.66 tokens/sec
    Combined Speed:     27.63 tokens/sec

Implement file-based configuration for the sbatch script to allow switching of the model and prompt without modifying the script directly. One way to do this is to use yq (uv tool install yq) to read from a YAML configuration file in bash.

Install Python package yq to process YAML files

uv tool install yq

Create config.yaml

model: gemma3:1b
prompt: "write a poem about georgia tech"

Export the YAML variables

export MODEL=$(yq -r '.model' config.yaml)
export PROMPT=$(yq -r '.prompt' config.yaml)

Verify variables

echo "$MODEL"
echo "$PROMPT"

Run script

sbatch ollama.sbatch

Implement a sbatch script to run vLLM serve using the ollama.sbatch as a template

See vllm.sbatch

LLM fine-tuning

🧗‍♂️Random Restart

Explorer

Recent Notes

Joy Package

Power

Computational Geometry

LLM Inference

Intro

Ollama

Prereqs

Running Ollama

Benchmark Ollama

Ollama script

vLLM

Exercises

What is the largest model in the Gemma3 family that you can run with ollama on a RTX6000 GPU?

Compare the performance metrics of a model family of your choice (e.g. gemma3 or phi4) across parameter sizes and quantization levels

Implement file-based configuration for the sbatch script to allow switching of the model and prompt without modifying the script directly. One way to do this is to use yq (uv tool install yq) to read from a YAML configuration file in bash.

Implement a sbatch script to run vLLM serve using the ollama.sbatch as a template

Next

Graph View

Table of Contents