Ollama is an open-source platform designed to run large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more locally. It allows users to generate text, assist with coding, and create content privately and securely on their own devices
Prereqs
Add the following enviornment variables to ~/.bashrc to make sure dependencies are cached to scratch instead of home
# to make pip-installable tools availableexport PATH=$HOME/.local/bin:$PATH# to redirect uv and huggingface cache to scratchexport XDG_CACHE_HOME="$HOME/scratch/.cache"# to redirect ollama models to scratchexport OLLAMA_MODELS=$HOME/scratch/ollama_models
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ollama:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Versions: ollama/0.5.1 ollama/0.6.6 ollama/0.9.0---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- For detailed information about a specific "ollama" package (including how to load the modules) use the module's full name. Note that names that have a trailing (E) are extensions provided by other modules.
Load the ollama module and start the server
# in the serving terminal windowmodule load ollamaollama serve
In another terminal, run the client do download and run a gemma3 model
# in the client terminal windowmodule load ollamaollama run gemma3:4b
pulling manifest
pulling aeda25e63ebd: 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 3.3 GB
pulling e0a42594d802: 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 358 B
pulling dd084c7d92a3: 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 8.4 KB
pulling 3116c5225075: 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 77 B
pulling b6ae5839783f: 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 489 B
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)
You can now type messages, example:
>>> write a haiku about georgia techYellow jackets soar,Techβs spirit, a brilliant hue,Future leaders rise.
Verify server is running
curl http://localhost:11434/api/tags | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 329 100 329 0 0 107k 0 --:--:-- --:--:-- --:--:-- 107k
{
"models": [
{
"name": "gemma3:4b",
"model": "gemma3:4b",
"modified_at": "2025-12-21T00:08:40-05:00",
"size": 3338801804,
"digest": "a2af6cc3eb7fa8be8504abaf9b04e88f17a119ec3f04a3addf55f92841195f5a",
"details": {
"parent_model": "",
"format": "gguf",
"family": "gemma3",
"families": [
"gemma3"
],
"parameter_size": "4.3B",
"quantization_level": "Q4_K_M"
}
}
]
}
ollama-benchmark --verbose --models gemma3:4b --prompts "who is george p burdell?"
Benchmarking Ollama
Verbose: TrueTest models: ['gemma3:4b']Prompts: ['who is george p burdell?']Evaluating models: ['gemma3:4b']Benchmarking: gemma3:4bPrompt: who is george p burdell?George P. Burdell (1924 - 2013) was a fascinating and remarkably long-lived American man who gained notoriety for his **extraordinary longevity and his claims of being immortal.** He became a minor celebrity and subject of numerous books, documentaries, and investigations. Here's a breakdown of what's known about him:**Key Facts & Claims:*** **Born in 1924:** Burdell was born in 1924, making him incredibly old by the time he died in 2013 at the age of 89.* **"Immortal" Claims:** This is where things get peculiar. Starting in the 1990s, Burdell began making bold assertions that he was essentially immortal, that he had been alive for centuries, and that he had witnessed historical events like the Crusades, the building of the pyramids, and the Roman Empire. He even claimed to have been a messenger for Leonardo da Vinci.* **Detailed Stories:** Burdell possessed a detailed, almost encyclopedic, knowledge of history, geography, and various cultures, far exceeding what would be expected of someone of his age. He could describe events in great detail, often with specific names and dates.* **His "Memory":** He attributed his incredible knowledge to a "memory" that extended back to his birth. He stated that he simply remembered everything that had happened in his life.* **Evidence & Skepticism:** His claims were met with widespread skepticism. There was no independent verification of his stories, and no evidence to support his extraordinary life story. Many suspected he was a master storyteller, possibly with a history of mental health issues or memory problems.**Possible Explanations & Theories:*** **Factitious Disorder/Malingering:** The most common explanation offered by skeptics is that Burdell was suffering from a psychiatric condition, such as Factitious Disorder (Munchausen's syndrome), where he deliberately fabricated a false illness or life story for attention.* **Highly Developed Imagination & Storytelling Skills:** Some researchers suggested he was simply an incredibly skilled storyteller, possessing a vivid imagination and a talent for mimicking historical details he had learned through books and research.* **Misinformation & Confusion:** Itβs possible that over time, Burdell's recollections became increasingly distorted and confused.* **A Combination of Factors:** Itβs likely that a combination of factors contributed to his claims, including his capacity for storytelling, potential psychological issues, and the passage of time.**Notable Works About Him:*** **"The Man Who Fell Into Time" (2006) by Jonathan Green:** This book is considered the most comprehensive and detailed account of Burdellβs life and claims. Itβs based on extensive interviews with him.* **Documentaries:** Several documentaries have been made about Burdell, exploring his story and the various theories surrounding it.**Resources for More Information:*** **The Man Who Fell Into Time (Website):** [https://www.manwhofellintime.com/](https://www.manwhofellintime.com/)* **Wikipedia:** [https://en.wikipedia.org/wiki/George_P._Burdell](https://en.wikipedia.org/wiki/George_P._Burdell)Ultimately, George P. Burdell remains a fascinating enigma. While his claims of immortality and historical experience are highly suspect, his story continues to intrigue and raise questions about memory, truth, and the human capacity for storytelling.Do you want me to delve deeper into any specific aspect of his story, such as:* The details of his claimed historical experiences?* The different theories about his background?* The process of creating "The Man Who Fell Into Time" book?Response: George P. Burdell (1924 - 2013) was a fascinating and remarkably long-lived American man who gained notoriety for his **extraordinary longevity and his claims of being immortal.** He became a minor celebrity and subject of numerous books, documentaries, and investigations. Here's a breakdown of what's known about him:**Key Facts & Claims:*** **Born in 1924:** Burdell was born in 1924, making him incredibly old by the time he died in 2013 at the age of 89.* **"Immortal" Claims:** This is where things get peculiar. Starting in the 1990s, Burdell began making bold assertions that he was essentially immortal, that he had been alive for centuries, and that he had witnessed historical events like the Crusades, the building of the pyramids, and the Roman Empire. He even claimed to have been a messenger for Leonardo da Vinci.* **Detailed Stories:** Burdell possessed a detailed, almost encyclopedic, knowledge of history, geography, and various cultures, far exceeding what would be expected of someone of his age. He could describe events in great detail, often with specific names and dates.* **His "Memory":** He attributed his incredible knowledge to a "memory" that extended back to his birth. He stated that he simply remembered everything that had happened in his life.* **Evidence & Skepticism:** His claims were met with widespread skepticism. There was no independent verification of his stories, and no evidence to support his extraordinary life story. Many suspected he was a master storyteller, possibly with a history of mental health issues or memory problems.**Possible Explanations & Theories:*** **Factitious Disorder/Malingering:** The most common explanation offered by skeptics is that Burdell was suffering from a psychiatric condition, such as Factitious Disorder (Munchausen's syndrome), where he deliberately fabricated a false illness or life story for attention.* **Highly Developed Imagination & Storytelling Skills:** Some researchers suggested he was simply an incredibly skilled storyteller, possessing a vivid imagination and a talent for mimicking historical details he had learned through books and research.* **Misinformation & Confusion:** Itβs possible that over time, Burdell's recollections became increasingly distorted and confused.* **A Combination of Factors:** Itβs likely that a combination of factors contributed to his claims, including his capacity for storytelling, potential psychological issues, and the passage of time.**Notable Works About Him:*** **"The Man Who Fell Into Time" (2006) by Jonathan Green:** This book is considered the most comprehensive and detailed account of Burdellβs life and claims. Itβs based on extensive interviews with him.* **Documentaries:** Several documentaries have been made about Burdell, exploring his story and the various theories surrounding it.**Resources for More Information:*** **The Man Who Fell Into Time (Website):** [https://www.manwhofellintime.com/](https://www.manwhofellintime.com/)* **Wikipedia:** [https://en.wikipedia.org/wiki/George_P._Burdell](https://en.wikipedia.org/wiki/George_P._Burdell)Ultimately, George P. Burdell remains a fascinating enigma. While his claims of immortality and historical experience are highly suspect, his story continues to intrigue and raise questions about memory, truth, and the human capacity for storytelling.Do you want me to delve deeper into any specific aspect of his story, such as:* The details of his claimed historical experiences?* The different theories about his background?* The process of creating "The Man Who Fell Into Time" book?---------------------------------------------------- Model: gemma3:4b Performance Metrics: Prompt Processing: 1651.27 tokens/sec Generation Speed: 93.36 tokens/sec Combined Speed: 95.22 tokens/sec Workload Stats: Input Tokens: 16 Generated Tokens: 755 Model Load Time: 0.06s Processing Time: 0.01s Generation Time: 8.09s Total Time: 8.16s----------------------------------------------------Average stats:---------------------------------------------------- Model: gemma3:4b Performance Metrics: Prompt Processing: 1651.27 tokens/sec Generation Speed: 93.36 tokens/sec Combined Speed: 95.22 tokens/sec Workload Stats: Input Tokens: 16 Generated Tokens: 755 Model Load Time: 0.06s Processing Time: 0.01s Generation Time: 8.09s Total Time: 8.16s----------------------------------------------------
Ollama script
Run script
sbatch ollama.sbatch
Submitted batch job 3016564
Check log output in logs/ollama
log
---------------------------------------Begin Slurm Prolog: Dec-21-2025 00:54:07Job ID: 3016564User ID: ctio3Account: paceship-dsgt_clef2026Job name: ollamaPartition: gpu-rtx6000QOS: embers---------------------------------------+ main+ hostnameatl1-1-02-004-33-0.pace.gatech.edu+ nvidia-smiSun Dec 21 00:54:07 2025+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 ||-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 Quadro RTX 6000 On | 00000000:3B:00.0 Off | Off || 33% 28C P8 5W / 260W | 1MiB / 24576MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| No running processes found |+-----------------------------------------------------------------------------------------++ module load ollama+ '[' -z '' ']'+ case "$-" in+ __lmod_sh_dbg=x+ '[' -n x ']'+ set +xShell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's outputShell debugging restarted+ unset __lmod_sh_dbg+ return 0+ MODEL=gemma3:270m++ find_free_port+++ python3 -c 'import socket; s=socket.socket(); s.bind(('\'''\'',0)); print(s.getsockname()[1]); s.close()'++ local port=45765++ lsof -i:45765++ echo 45765+ PORT=45765+ export OLLAMA_HOST=localhost:45765+ OLLAMA_HOST=localhost:45765+ SERVER_PID=1587307+ trap 'echo '\''Killing server pid 1587307'\''; kill 1587307' EXIT+ wait_for_ollama_host localhost:45765+ ollama serve+ local host=localhost:45765+ curl --retry 5 --retry-connrefused --retry-delay 0 -sf localhost:45765> > time=2025-12-21T00:54:09.128-05:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://localhost:45765 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/storage/home/hcoda1/1/ctio3/scratch/ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false OCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"time=2025-12-21T00:54:09.139-05:00 level=INFO source=images.go:479 msg="total blobs: 5"time=2025-12-21T00:54:09.140-05:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"time=2025-12-21T00:54:09.142-05:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:45765 (version 0.9.0)"time=2025-12-21T00:54:09.142-05:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"> time=2025-12-21T00:54:09.536-05:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1560f2db-2323-b99b-365e-50480c1c4571 library=cuda variant=v12 compute=7.5 river=12.9 name="Quadro RTX 6000" total="23.5 GiB" available="23.3 GiB"[GIN] 2025/12/21 - 00:54:10 | 200 | 63.003Β΅s | 127.0.0.1 | GET "/"Ollama is running+ ollama pull gemma3:270m[GIN] 2025/12/21 - 00:54:10 | 200 | 32.901Β΅s | 127.0.0.1 | HEAD "/"> [?2026h[?25l[1Gpulling manifest β [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest β [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest β Ή [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest β Έ [K[?25h[?2026ltime=2025-12-21T00:54:10.622-05:00 level=INFO source=download.go:177 msg="downloading 735af2139dc6 in 3 100 MB art(s)"> [?2026h[?25l[1Gpulling manifest β Ό [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest β ΄ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest β ¦ [K[?25h[?026l[?2026h[?25l[1Gpulling manifest β § [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [Kpulling 735af2139dc6: 0% β β 1.2 MB/291 MB [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K...pulling 74156d92caf6: 100% ββββββββββββββββββββ 490 B [Kverifying sha256 digest [Kwriting manifest [Ksuccess [K[?25h[?2026l+ wait_for_completion_api 45765 gemma3:270m+ local port=45765+ local model=gemma3:270m+ echo 'Waiting for API...'Waiting for API...+ for i in {1..50}+ grep -q 200> + curl -s -w '%{http_code}' -o /dev/null http://localhost:45765/v1/completions -H 'Content-Type: application/json' -d '{"model":"gemma3:270m","prompt":"Hi","max_tokens":1,temperature":0}'> time=2025-12-21T00:54:19.643-05:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/storage/home/hcoda1/1/ctio3/scratch/ollama_models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 gpu=GPU-1560f2db-2323-b99b-365e-50480c1c4571 parallel=2 available=25015746560 equired="1.3 GiB"time=2025-12-21T00:54:19.866-05:00 level=INFO source=server.go:135 msg="system memory" total="376.0 GiB" free="362.7 GiB" free_swap="8.0 GiB"> time=2025-12-21T00:54:19.866-05:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.3 GiB" memory.required.partial="1.3 GiB" memory.required.kv="46.5 MiB" memory.required.allocations="[1.3 GiB]" memory.weights.total="271.8 MiB" memory.weights.repeating="101.8 MiB" memory.weights.nonrepeating="170.0 MiB" memory.graph.full="513.2 MiB" memory.graph.partial="644.5 iB"> time=2025-12-21T00:54:19.920-05:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/storage/pace-apps/manual/packages/ollama/0.9.0/bin/ollama runner --ollama-engine --model /storage/home/hcoda1/1/ctio3/scratch/ollama_models/blobs/sha256-735af2139dc652bf01112746474883d79a52fa1c19038265d363e3d42556f7a2 --ctx-size 8192 -batch-size 512 --n-gpu-layers 19 --threads 24 --parallel 2 --port 41071"time=2025-12-21T00:54:19.923-05:00 level=INFO source=sched.go:483 msg="loaded runners" count=1time=2025-12-21T00:54:19.923-05:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"time=2025-12-21T00:54:19.924-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"time=2025-12-21T00:54:19.938-05:00 level=INFO source=runner.go:925 msg="starting ollama engine"time=2025-12-21T00:54:19.938-05:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41071"time=2025-12-21T00:54:19.988-05:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q8_0 name="" description="" num_tensors=236 num_key_values=37load_backend: loaded CPU backend from /storage/pace-apps/manual/packages/ollama/0.9.0/lib/ollama/libggml-cpu-skylakex.soggml_cuda_init: GGML_CUDA_FORCE_MMQ: noggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: noggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 6000, compute capability 7.5, VMM: yesload_backend: loaded CUDA backend from /storage/pace-apps/manual/packages/ollama/0.9.0/lib/ollama/cuda_v12/libggml-cuda.so> time=2025-12-21T00:54:20.078-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.VX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)time=2025-12-21T00:54:20.175-05:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"time=2025-12-21T00:54:20.227-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="170.0 MiB"time=2025-12-21T00:54:20.227-05:00 level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="271.9 MiB"time=2025-12-21T00:54:20.247-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="89.5 MiB"time=2025-12-21T00:54:20.247-05:00 level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="1.2 MiB"time=2025-12-21T00:54:20.426-05:00 level=INFO source=server.go:630 msg="llama runner started in 0.50 seconds"[GIN] 2025/12/21 - 00:54:20 | 200 | 1.311360013s | 127.0.0.1 | POST "/v1/completions"+ echo 'API ready'API ready+ return 0+ echo 'running inference via ollama frontend'running inference via ollama frontend+ ollama run gemma3:270m 'write a haiku about georgia tech'[GIN] 2025/12/21 - 00:54:20 | 200 | 27.451Β΅s | 127.0.0.1 | HEAD "/"[GIN] 2025/12/21 - 00:54:20 | 200 | 76.71819ms | 127.0.0.1 | POST "/api/show"Golden sunbeams gleam,Silicon dreams in the air,A future bright and bold.[GIN] 2025/12/21 - 00:54:20 | 200 | 166.228481ms | 127.0.0.1 | POST "/api/generate"+ echo 'running inference via openai api frontend'running inference via openai api frontend+ python_openai_query 45765 gemma3:270m 'write a haiku about georgia tech'+ local port=45765+ local model=gemma3:270m+ local 'prompt=write a haiku about georgia tech'+ uv run -/var/lib/slurm/slurmd/job3016564/slurm_script: line 50: uv: command not found+ echo 'Killing server pid 1587307'Killing server pid 1587307+ kill 1587307---------------------------------------Begin Slurm Epilog: Dec-21-2025 00:54:21Job ID: 3016564User ID: ctio3Account: paceship-dsgt_clef2026Job name: ollamaResources: cpu=6,gres/gpu:rtx_6000=1,mem=24G,node=1Rsrc Used: cput=00:01:42,vmem=0,walltime=00:00:17,mem=870188K,energy_used=0Partition: gpu-rtx6000QOS: embersNodes: atl1-1-02-004-33-0---------------------------------------
Verify job is running
squeue -u ctio3
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)3016578 gpu-rtx60 ollama ctio3 R 0:00 1 atl1-1-02-005-31-03016056 gpu-rtx60 sys/dash ctio3 R 58:55 1 atl1-1-03-003-19-0
vLLM
Benchmark a model with vLLM
More specifically: run a throughput benchmark on the Qwen3-0.6B model, measuring how many tokens it can generate per
second with inputs of 32 tokens, outputting only 1 token, and with a maximum context length of 1024 tokens
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 507 100 481 100 26 600 32 --:--:-- --:--:-- --:--:-- 632{ "id": "cmpl-b214bbf7385900cc", "object": "text_completion", "created": 1766302389, "model": "Qwen/Qwen3-0.6B", "choices": [ { "index": 0, "text": "Question\n\nThe given function is:\n\nf(x) = 3x^3", "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 1, "total_tokens": 17, "completion_tokens": 16, "prompt_tokens_details": null }, "kv_transfer_params": null}
Exercises
Exercises
What is the largest model in the Gemma3 family that you can run with ollama on a RTX6000 GPU?
Nvidia Quadro RTX 6000 has 24.6 GB avaialble. From the gemma3 model list https://ollama.com/library/gemma3/tags, gemma3:27b-it-qat is the largest model (18GB) you can run on this GPU.
+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 ||-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 Quadro RTX 6000 On | 00000000:AF:00.0 Off | Off || 33% 33C P2 62W / 260W | 4976MiB / 24576MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| 0 N/A N/A 1125743 C ...kages/ollama/0.9.0/bin/ollama 4972MiB |+-----------------------------------------------------------------------------------------+
Compare the performance metrics of a model family of your choice (e.g. gemma3 or phi4) across parameter sizes and quantization levels
When benchmarking gemma3 using speed (tokens/sec), the speed decreases with higher parameter sizes. For the same parameter size, inference speed is similar across quantization levels.
Implement file-based configuration for the sbatch script to allow switching of the model and prompt without modifying the script directly. One way to do this is to use yq (uv tool install yq) to read from a YAML configuration file in bash.