Module 5 · Lesson 1

What Is Ollama?

From a weekend project to powering millions of local AI deployments — the tool that made running LLMs genuinely easy.

Why did it take until 2023 for running local models to feel as simple as installing an app?

When llama.cpp landed in early 2023, thousands of developers suddenly had the ability to run Meta's LLaMA models on consumer hardware. The capability existed — but harnessing it required compiling C++ from source, manually downloading multi-gigabyte GGUF weight files from Hugging Face, crafting launch commands with a dozen flags, and repeating every step for each new model. The friction was enormous.

In late 2023, Jeffrey Morgan and collaborators shipped Ollama as an open-source project. It wrapped the llama.cpp inference engine in a clean CLI and a local REST server, added automatic model downloads, and packaged the whole thing as a single installer. By early 2024, the project had accumulated over 40,000 GitHub stars and was being pulled into enterprise pilots at companies including Shopify and Slack for offline developer tooling.

The Core Idea

Ollama is a model runner — not a model itself. It sits between you and the underlying inference engine (llama.cpp, or for some backends, Metal/CUDA kernels), providing three things: a model registry, a runtime manager, and a local API.

Think of it like Docker, but for language models. Just as Docker packages apps with their dependencies so they run consistently anywhere, Ollama packages models with their configuration — quantization level, context length defaults, system prompt templates — into a single artifact called an Ollama Model.

The Docker Analogy

The Ollama team deliberately chose architecture that mirrors Docker's design. ollama pull mirrors docker pull. ollama run mirrors docker run. The local registry at ~/.ollama/models mirrors Docker's image store. This design choice made adoption dramatically faster among developers already comfortable with container workflows.

What Ollama Is Not

Ollama is not a chat interface — it has a minimal terminal REPL, but most users pair it with a separate front-end like Open WebUI or Enchanted. It is not a training or fine-tuning tool. And it is not a cloud service; every inference call runs entirely on your local machine against a server process bound to localhost:11434.

Key Terms

GGUFThe file format used by llama.cpp and Ollama for storing quantized model weights. Replaced the older GGML format in mid-2023. A single GGUF file contains both the weights and the model architecture metadata.

ModelfileA plain-text configuration file (analogous to a Dockerfile) that tells Ollama which base weights to use, what system prompt to apply, and what inference parameters (temperature, context length) to set by default.

Ollama RegistryThe public model library at ollama.com/library. As of mid-2025, it hosts over 100 models including Llama 3, Mistral, Gemma, Phi-3, Qwen, and DeepSeek variants.

llama.cppThe C++ inference engine written by Georgi Gerganov that Ollama uses internally. It handles quantization, CPU/GPU scheduling, and the actual matrix math of transformer inference.

The Ollama API — Why It Matters

Beyond the CLI, Ollama exposes a REST API at http://localhost:11434. This means any application that can make an HTTP call — a Python script, a Node.js app, a shell script — can use locally running models without any API key or cloud dependency. The API also implements the OpenAI-compatible endpoint at /v1/chat/completions, meaning tools built for the OpenAI API work against Ollama with a single URL change.

This OpenAI compatibility layer was added in Ollama v0.1.24 (November 2023) and has been one of the most consequential features for enterprise adoption — it means switching from cloud to local inference requires changing one line in most codebases.

Real Deployment Note

GitHub Copilot alternatives like Continue.dev and Cody both support Ollama as a backend, allowing developers to run coding assistants entirely offline. As of 2024, Continue.dev listed Ollama as its recommended local backend for teams with data-residency requirements, specifically citing healthcare and legal use cases.

Lesson 1 Quiz

What Is Ollama?

1. Ollama's architecture is most analogous to which existing tool?

Correct. Ollama deliberately mirrors Docker's design — pull, run, and local registry model — making it immediately familiar to developers already using containers.

Not quite. Ollama's creators explicitly modeled it on Docker's architecture: a model registry, a runtime manager, and packaged artifacts with their configuration.

2. What port does Ollama's local API server bind to by default?

Correct. Ollama binds to localhost:11434. Any local application can make HTTP requests to this address without authentication.

The correct port is 11434. This is the default and can be changed via the OLLAMA_HOST environment variable, but 11434 is what you'll see in almost all documentation.

3. The OpenAI-compatible endpoint in Ollama allows developers to do what?

Correct. The /v1/chat/completions endpoint mirrors OpenAI's API signature, so existing integrations need only point their base URL at localhost:11434 instead of api.openai.com.

The compatibility layer doesn't proxy to OpenAI — it means your local Ollama server speaks OpenAI's API format, so apps written for OpenAI can talk to local models with a URL change.

4. Which of the following is NOT something Ollama does?

Correct. Ollama is purely an inference runner — it does not train or fine-tune models. For fine-tuning, you'd use separate tools like Unsloth, axolotl, or the transformers library.

Ollama does not fine-tune models. It downloads, stores, and runs pre-built models. Fine-tuning requires different tools and usually much more GPU memory than inference.

Lab 1 — Exploring Ollama's Architecture

Discuss Ollama's design decisions and how they compare to alternatives.

Your Task

You've just learned how Ollama wraps llama.cpp with a Docker-inspired interface. In this lab, discuss Ollama's design with your AI lab assistant. Explore why the Docker-style architecture matters, what the OpenAI compatibility endpoint changes for developers, and how Ollama differs from running llama.cpp directly.

Complete at least 3 exchanges to finish the lab.

Starter questions: Why did Ollama choose to mimic Docker's design? What problems does the OpenAI-compatible endpoint solve? How would you explain Ollama to a developer who has never run a local LLM?

AI Lab Assistant

Ollama Architecture

Module 5 · Lesson 2

Installing Ollama

Platform-specific setup for macOS, Linux, and Windows — plus verifying your installation works correctly.

What does the installer actually do under the hood, and how do you confirm everything is running correctly?

When Mozilla's Innovation team ran an internal pilot of Ollama for offline developer tooling in early 2024, their setup guide noted that the most common failure mode wasn't installation — it was verification. Developers would install Ollama, assume it was running, then wonder why their IDE extension wasn't connecting. The fix was always the same: check that the Ollama service was actually started and that port 11434 was listening. This lesson teaches you to install correctly and verify confidently.

System Requirements

Ollama runs on macOS 11 Big Sur or later (Apple Silicon recommended but Intel supported), Linux (x86_64 and ARM64), and Windows 10/11 (via WSL2 or the native preview installer). GPU acceleration requires NVIDIA (CUDA 11.3+) or AMD (ROCm 5.7+) on Linux, or the Apple GPU on macOS via Metal.

Minimum RAM: 8 GB to run 7B models. 16 GB recommended. 32 GB+ for 13B–34B models. CPU-only mode works but is significantly slower — expect 2–5 tokens/second instead of 30–80 tokens/second on a modern GPU.

Installation by Platform

macOS

Linux

Windows

# Download the macOS installer from ollama.com/download
# Or install via Homebrew:
brew install ollama

# Start the Ollama service (runs in menu bar on Mac)
ollama serve
# Or simply open Ollama.app from Applications

On macOS, Ollama installs as a menu bar application. When you launch it, the Ollama service starts automatically and places an icon in your menu bar. The CLI is also installed at /usr/local/bin/ollama. Apple Silicon Macs (M1/M2/M3/M4) use Metal for GPU acceleration automatically — no extra configuration needed.

# One-line installer (official, from ollama.com)
curl -fsSL https://ollama.com/install.sh | sh

# The installer creates a systemd service.
# Check it's running:
systemctl status ollama

# If not running, start it:
sudo systemctl start ollama
sudo systemctl enable ollama

On Linux, the install script detects your NVIDIA or AMD GPU automatically and installs the appropriate CUDA or ROCm libraries. The Ollama service is registered as a systemd unit that starts on boot. For servers without GPUs, the script installs CPU-only mode.

# Windows: Download OllamaSetup.exe from ollama.com/download
# Run the installer — it installs to %LOCALAPPDATA%\Programs\Ollama
# Ollama starts automatically and appears in the system tray

# PowerShell verification:
ollama --version
# Expected output: ollama version 0.x.x

On Windows, Ollama runs natively (no WSL2 required as of v0.1.22). The installer handles path setup. NVIDIA GPU support works via CUDA drivers. AMD GPU support on Windows arrived in Ollama v0.1.29. If you have WSL2 installed, you can also run Ollama inside WSL2 and access it from Windows applications.

Verifying Your Installation

Three verification steps, in order:

Version check: Run ollama --version in your terminal. If you see a version number, the CLI is installed correctly.
Service check: Run ollama list — this asks the Ollama service for your installed models. If you get an error like "connection refused," the service isn't running. Start it with ollama serve in a separate terminal.
API check: Run curl http://localhost:11434 or open that URL in a browser. You should see the text "Ollama is running." If you see this, the API server is live and accepting requests.

Common Installation Issue

On Linux, if ollama serve returns "address already in use," another Ollama instance (often the systemd service) is already running. Use sudo systemctl stop ollama before running manually, or simply use the service. Don't run both simultaneously.

GPU Detection

After installing, run ollama run llama3.2:1b (a small model good for testing) and look at the output. Ollama prints which device it's using: you'll see something like "running on GPU: NVIDIA GeForce RTX 3080" or "running on CPU". If you expect GPU acceleration but see CPU, check that your GPU drivers are up to date and that Ollama's CUDA/ROCm libraries are correctly installed.

Environment Variable Reference

OLLAMA_HOST — Change the bind address (default: 127.0.0.1:11434). Set to 0.0.0.0:11434 to allow network access from other machines. OLLAMA_MODELS — Change where model files are stored (default: ~/.ollama/models). OLLAMA_NUM_PARALLEL — Number of parallel inference requests (default: 1). OLLAMA_MAX_LOADED_MODELS — How many models to keep loaded in RAM simultaneously.

Lesson 2 Quiz

Installing Ollama

1. What does the command curl http://localhost:11434 verify?

Correct. A successful response ("Ollama is running") from localhost:11434 confirms the API server is live. The CLI and model downloads are separate concerns.

The curl command checks only that the HTTP server is listening at that port. It says nothing about drivers, models, or the CLI — just that the API endpoint is reachable.

2. On macOS with Apple Silicon, what technology does Ollama use for GPU acceleration?

Correct. Apple Silicon Macs use Metal, Apple's GPU API. Ollama (via llama.cpp) uses Metal automatically on M1/M2/M3/M4 chips — no configuration needed.

CUDA is NVIDIA-specific. ROCm is AMD-specific. On Apple Silicon, Ollama uses Metal — Apple's own GPU compute API — which is configured automatically.

3. If ollama serve returns "address already in use" on Linux, what is the most likely cause?

Correct. On Linux, the installer creates a systemd service that starts automatically. Running ollama serve manually then tries to bind to the same port — fix it by stopping the systemd service first.

The error means port 11434 is already taken — most likely by the systemd Ollama service that started on boot. Stop it with sudo systemctl stop ollama before running manually.

4. Which environment variable would you set to allow other machines on your network to reach your Ollama server?

Correct. Setting OLLAMA_HOST=0.0.0.0:11434 makes the server listen on all network interfaces, allowing other devices to connect. By default it only binds to 127.0.0.1 (localhost).

OLLAMA_HOST controls the network bind address. Setting it to 0.0.0.0:11434 exposes Ollama on all interfaces. The other variables control model storage, concurrency, and memory.

Lab 2 — Installation Troubleshooting

Work through real installation scenarios and verification steps.

Your Task

Your AI assistant will present you with Ollama installation scenarios — things that commonly go wrong on real machines. Work through the diagnosis and fix for each scenario. Discuss verification steps, environment variables, and platform-specific considerations.

Complete at least 3 exchanges to finish the lab.

Scenario to start: "I installed Ollama on Linux, ran ollama run mistral, and got 'Error: could not connect to ollama app. Is it running?' What are the steps to diagnose and fix this?"

AI Lab Assistant

Installation Troubleshooting

Module 5 · Lesson 3

Pulling and Running Models

The Ollama model library, quantization tags, and how to choose the right model size for your hardware.

With dozens of models available, how do you pick the right one — and what do those cryptic tags like "q4_K_M" actually mean?

When Meta released Llama 3 in April 2024, Ollama had it available as ollama pull llama3 within hours. Within 48 hours, it had been downloaded over 500,000 times from the Ollama registry — more than any previous model launch on the platform. The speed of availability became a defining advantage: model releases that previously required navigating Hugging Face's interface, finding the right GGUF conversion, and manually configuring llama.cpp could now be accessed with a single command.

The Ollama Model Library

Visit ollama.com/library to browse available models. Each model page shows available tags (sizes and quantizations), parameter counts, and VRAM requirements. The search is filterable by capability: code, vision, embedding, and so on.

The most commonly used models as of mid-2025:

llama3.2 mistral gemma3 phi4 deepseek-r1 qwen3 llava (vision)

Pull, Run, and List

# Pull a model (downloads to ~/.ollama/models)
ollama pull llama3.2

# Run a model (pulls automatically if not present)
ollama run llama3.2

# List all locally installed models
ollama list

# Remove a model to free disk space
ollama rm llama3.2

# Show model info (parameters, template, license)
ollama show llama3.2

Understanding Quantization Tags

Every model in the Ollama library has a default tag and optional variants. The default (e.g., llama3.2) points to the recommended quantization for most hardware. Specific tags let you choose different tradeoffs.

:3b / :7b / :13bParameter count — more parameters generally means higher quality but requires more RAM and runs more slowly. A 7B model needs ~4–8 GB RAM depending on quantization.

q4_K_M4-bit quantization, K-quant method, Medium variant. The most common default. Balances quality and memory use well. "K-quants" are smarter than basic 4-bit quantization — they apply higher precision to more important layers.

q8_08-bit quantization. Higher quality, roughly 2× the RAM of q4_K_M. Use when you have enough VRAM and want quality closer to full precision.

q2_K2-bit quantization. Lowest memory use, but significant quality degradation. Useful only when RAM is extremely limited.

fp16Half-precision floating point — essentially unquantized. Best quality but 4× the RAM of q4_K_M. Only practical on high-VRAM GPUs (24 GB+).

# Pull a specific quantization variant
ollama pull llama3.2:3b-instruct-q8_0
ollama pull mistral:7b-instruct-q4_K_M
ollama pull gemma3:27b-it-q4_K_M

Choosing the Right Model for Your Hardware

Quick Reference: RAM to Model Size

8 GB RAM: 7B models at q4_K_M (fits comfortably) or 3B models with room to spare. 16 GB RAM: 13B models at q4_K_M, or 7B models at q8_0. 32 GB RAM: 34B models at q4_K_M. 64 GB+ RAM: 70B models become feasible. Apple Silicon users benefit from unified memory — a MacBook Pro M3 Max with 128 GB can run 70B models entirely in fast RAM.

Running a Model — What to Expect

The first run of any model downloads the GGUF file (2–40 GB depending on size and quantization). After that, it's cached locally. When you run ollama run model, you enter an interactive REPL where you can type prompts. Type /bye to exit, or /help for in-REPL commands.

# Single-shot inference from the command line (no REPL)
ollama run llama3.2 "Explain transformer attention in one paragraph."

# Pipe input from a file
cat document.txt | ollama run mistral "Summarize this:"

# Useful REPL commands
/set parameter temperature 0.2   — lower = more deterministic
/set parameter num_ctx 8192      — expand context window
/show info                        — display loaded model details
/bye                              — exit the REPL

Real-World Tip from the Ollama Community

When benchmarking models, the Ollama team's own documentation suggests using llama3.2:3b for quick tests (fast download, runs on any machine) and mistral:7b-instruct-q4_K_M as a quality baseline for instruction-following tasks. For coding, qwen2.5-coder:7b consistently outperforms general models of the same size on code completion benchmarks as of early 2025.

Lesson 3 Quiz

Pulling and Running Models

1. What does the "q4_K_M" tag on an Ollama model mean?

Correct. q4_K_M = 4-bit quantization, K-quant method (which applies variable precision across layers), Medium variant. It's the most common default because it balances quality and memory efficiently.

The q prefix means quantization bits (4-bit here), K refers to the K-quant quantization method, and M is the Medium size variant of that method. Nothing to do with architecture names or parameter counts.

2. A developer has 16 GB of RAM. Which of the following would fit most comfortably?

Correct. A 13B model at q4_K_M requires roughly 8–10 GB of RAM, fitting comfortably in 16 GB with room for the OS and other applications.

70B q4_K_M needs ~40 GB. 34B q8_0 needs ~35 GB. 7B fp16 needs ~14 GB (tight, often unstable). The 13B q4_K_M at ~9 GB is the right fit for 16 GB systems.

3. What command would you use to remove a model and free disk space?

Correct. ollama rm [model-name] removes the model from your local store. This is the only command that deletes model files.

The correct command is ollama rm. Ollama doesn't use delete, uninstall, or prune — rm mirrors the Unix convention for removal.

4. Which quantization level offers the best quality but requires the most RAM?

Correct. fp16 (half-precision float) is essentially unquantized — highest quality but roughly 4× the RAM of q4_K_M. Only practical on machines with 24 GB+ VRAM.

The quality/memory tradeoff goes: q2_K (worst quality, least RAM) → q4_K_M → q8_0 → fp16 (best quality, most RAM). fp16 is the most demanding and highest quality.

Lab 3 — Model Selection Strategy

Practice choosing models and quantizations for real hardware constraints.

Your Task

Your AI assistant will present hardware scenarios with specific RAM, GPU, and use-case requirements. For each, work out which model and quantization tag you'd recommend and why. Discuss the tradeoffs between quality, speed, and memory for different Ollama model choices.

Complete at least 3 exchanges to finish the lab.

First scenario: "A legal team wants to run document summarization locally on laptops with 16 GB RAM and integrated Intel graphics. They need reasonable quality, privacy is essential, and speed is secondary. What Ollama model and tag would you recommend, and why?"

AI Lab Assistant

Model Selection

Module 5 · Lesson 4

The Ollama API and Modelfiles

Calling Ollama from code, building custom model variants with Modelfiles, and integrating with popular tools.

How do you go from running a model in the terminal to actually building something useful with it?

In mid-2024, the Continue.dev team published a case study describing how a healthcare software company used Ollama's REST API to build an internal code review assistant. The system pulled a custom Modelfile-configured version of CodeLlama with a system prompt tailored to their coding standards, then called it from a VS Code extension via the OpenAI-compatible endpoint. The entire pipeline — from engineer's code editor to inference result — never left the company's network. The key technical detail: using Ollama's Modelfile to bake in the system prompt meant the application code stayed clean and the model's persona could be updated without touching the application.

Calling the REST API

Ollama exposes two main API styles: its native API (simpler, Ollama-specific) and the OpenAI-compatible API (broader compatibility). Both are served from localhost:11434.

# Native Ollama API — generate endpoint
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "What is the capital of France?",
    "stream": false
  }'

# Native API — chat endpoint (maintains conversation format)
curl http://localhost:11434/api/chat \
  -d '{
    "model": "mistral",
    "messages": [
      {"role": "user", "content": "Explain Docker in one sentence."}
    ],
    "stream": false
  }'

# OpenAI-compatible endpoint (drop-in replacement)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Calling from Python

# Using the official ollama Python library
pip install ollama

import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]
)
print(response['message']['content'])

# Or using openai library with Ollama as backend
pip install openai

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
    model='mistral',
    messages=[{'role': 'user', 'content': 'Write a haiku about inference.'}]
)
print(response.choices[0].message.content)

Modelfiles — Building Custom Models

A Modelfile is a plain-text configuration that creates a new named model in your local Ollama registry. Think of it as a lightweight wrapper around an existing model that bakes in a system prompt, default parameters, and optionally a custom stop token format.

# Example Modelfile: a focused coding assistant
FROM mistral

# Set inference parameters
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

# System prompt baked into the model
SYSTEM """You are a precise Python coding assistant. You write clean,
well-commented Python 3 code. You always include type hints. You explain
your choices briefly. Never write code without explaining it."""

# Build the custom model
ollama create python-assistant -f ./Modelfile

# Run it exactly like any other model
ollama run python-assistant

Key Modelfile Instructions

FROMSpecifies the base model. Can be an Ollama library model name (e.g., FROM llama3.2) or a path to a local GGUF file (FROM ./model.gguf). Required — every Modelfile needs this.

SYSTEMSets the system prompt. This text is prepended to every conversation automatically. Use it to define the assistant's persona, constraints, or domain focus.

PARAMETERSets inference parameters. Common ones: temperature (0–1, higher = more creative), num_ctx (context window tokens), top_p (nucleus sampling threshold), repeat_penalty (reduces repetition).

TEMPLATEOverrides the default chat template. Most models have their own template built in (Llama uses [INST] markers, Mistral uses <s> tags) — only override this if you know exactly what you're doing.

Streaming Responses

By default, the API streams tokens as they're generated. This is great for user-facing applications but requires handling streaming JSON. Set "stream": false to get a single complete response (easier for scripts and testing). For production applications, streaming almost always produces a better user experience.

# Python streaming example
import ollama

for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Count to 10 slowly.'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Integration Ecosystem

As of mid-2025, Ollama is natively supported by: LangChain (OllamaLLM and OllamaEmbeddings classes), LlamaIndex (Ollama LLM and embedding integrations), Continue.dev (VS Code/JetBrains IDE extension), Open WebUI (browser-based chat front-end), Enchanted (macOS native app), AnythingLLM (document chat system), and Jan (cross-platform desktop app). In each case, you point the tool at localhost:11434 and it works with any model you've pulled.

Lesson 4 Quiz

The Ollama API and Modelfiles

1. What is the minimum required instruction in every Ollama Modelfile?

Correct. FROM specifies the base model and is the only required instruction. Every other Modelfile instruction — SYSTEM, PARAMETER, TEMPLATE — is optional.

FROM is the only required Modelfile instruction. It tells Ollama what base model (or GGUF file) to build from. SYSTEM, PARAMETER, and TEMPLATE are all optional customizations.

2. An app is built to call the OpenAI API. What change is needed to point it at a local Ollama server instead?

Correct. Ollama's /v1/chat/completions endpoint is OpenAI API-compatible. Changing the base_url to http://localhost:11434/v1 is all that's needed. The API key can be set to any non-empty string like "ollama".

The OpenAI-compatible endpoint at /v1 means you only need to change the base URL. No middleware, no SDK swap, no rewrite — just redirect the client to localhost:11434/v1.

3. In an Ollama Modelfile, the SYSTEM instruction does what?

Correct. SYSTEM sets the system prompt that will be automatically prepended to every conversation with this model variant. It defines the assistant's persona, constraints, or domain focus.

SYSTEM in a Modelfile is the system prompt — the text that shapes the model's behavior for every conversation. It's nothing to do with OS requirements or hardware libraries.

4. Which PARAMETER setting would make a model's outputs more deterministic and consistent?

Correct. Lower temperature (close to 0) makes the model more deterministic — it consistently picks the highest-probability next token. Temperature 0.1 is typical for coding assistants and factual applications.

Temperature controls randomness. Lower = more deterministic. Temperature 0.1 makes outputs very consistent. Temperature 1.5 would make outputs more random. top_p and num_ctx don't primarily control determinism.

Lab 4 — Designing Modelfiles and API Integrations

Practice designing Modelfiles and planning API integrations for real use cases.

Your Task

Work with your AI assistant to design Modelfiles for specific use cases and plan how to integrate Ollama's API into real applications. Discuss system prompt design, parameter choices, and which integration approach (native API vs OpenAI-compatible) best fits different scenarios.

Complete at least 3 exchanges to finish the lab.

Start here: "Design a Modelfile for a customer support assistant for a software company. The assistant should be helpful but concise, never make up product information, and always ask clarifying questions when a request is ambiguous. What base model, PARAMETER settings, and SYSTEM prompt would you use?"

AI Lab Assistant

Modelfiles & API Design

Module 5 Test

Setting Up Ollama — 15 questions · 80% to pass

1. Ollama's creators explicitly modeled its design on which tool?

Correct. Ollama mirrors Docker's pull/run model and local registry architecture.

Ollama was designed to mirror Docker — pull, run, and local image store — not Kubernetes, pip, or Homebrew.

2. What does GGUF stand for and what does it contain?

Correct. GGUF files contain both quantized weights and architecture metadata — everything needed to load and run the model.

GGUF contains both weights and architecture metadata. It replaced GGML in mid-2023 and is the standard format for llama.cpp and Ollama models.

3. Which command checks whether the Ollama service is reachable?

Correct. Curling localhost:11434 returns "Ollama is running" when the service is live. Ollama has no status, ping, or health subcommand.

Ollama doesn't have a status, ping, or health command. Checking the HTTP endpoint with curl is the standard verification method.

4. On macOS Apple Silicon, Ollama uses which GPU acceleration technology?

Correct. Metal is Apple's GPU compute API used automatically by Ollama on M-series chips.

Metal is Apple's answer to CUDA for GPU compute. Ollama uses it automatically on Apple Silicon — CUDA is NVIDIA, ROCm is AMD, Vulkan is cross-platform but not used here.

5. The OpenAI-compatible endpoint in Ollama is at which path?

Correct. /v1/chat/completions mirrors the OpenAI API path exactly, allowing drop-in compatibility.

The OpenAI-compatible path is /v1/chat/completions — the same path used by the OpenAI API, allowing apps to work with both by just changing the base URL.

6. What does ollama list do?

Correct. ollama list shows your local model store — the models downloaded to ~/.ollama/models.

ollama list shows locally installed models. For the online library, visit ollama.com/library. For running processes, there's no dedicated command — use ollama ps in newer versions.

7. Which environment variable changes where Ollama stores model files?

Correct. OLLAMA_MODELS overrides the default ~/.ollama/models path. Useful when models should live on an external drive or network share.

OLLAMA_MODELS sets the model storage directory. OLLAMA_HOST controls the network bind address. The others aren't standard Ollama environment variables.

8. A developer has 8 GB RAM. Which model would run most reliably?

Correct. A 3B model at q4_K_M needs roughly 2–3 GB — easily fits in 8 GB with plenty of room for the OS. Larger models would exhaust RAM and cause crashes or severe slowdowns.

With 8 GB RAM, a 3B model at q4_K_M (~2–3 GB) is the safe choice. 7B q8_0 (~8 GB) would use nearly all RAM. 13B and 27B are far too large.

9. What is the purpose of the PARAMETER instruction in a Modelfile?

Correct. PARAMETER sets inference-time settings like temperature (randomness), num_ctx (context window), top_p (sampling), and repeat_penalty.

PARAMETER in a Modelfile sets inference settings — temperature, context window, sampling parameters. It has nothing to do with Python packages, model size declarations, or environment variables.

10. Which quantization offers the best balance of quality and memory use?

Correct. q4_K_M is the standard recommended quantization for most use cases — it uses the K-quant method to preserve quality better than basic 4-bit quantization while keeping RAM use low.

q4_K_M is the community standard for balanced use. q2_K sacrifices too much quality. fp16 and q8_0 are higher quality but require significantly more RAM. q4_K_M is why Ollama defaults to it for most models.

11. On Linux, what installs the Ollama service as a systemd unit?

Correct. The curl-based install script from ollama.com handles the systemd unit creation automatically — including the service file and enabling it on boot.

The one-liner curl -fsSL https://ollama.com/install.sh | sh creates and enables the systemd service automatically. Ollama has no install --service flag, and it's not in standard apt/yum repos.

12. Which tool listed below natively supports Ollama as a local backend for IDE-based AI assistance?

Correct. Continue.dev is an open-source IDE extension for VS Code and JetBrains that natively supports Ollama as its local inference backend.

Continue.dev is the correct answer. It's an open-source coding assistant extension that lists Ollama as its recommended local backend. GitHub Copilot, Cursor, and CodeWhisperer are cloud-only services.

13. What command creates a new named model from a Modelfile?

Correct. ollama create [name] -f [Modelfile] is the command. After running it, the model appears in ollama list and can be run like any other.

The correct command is ollama create. Ollama doesn't use build, new, or package for this operation — create is the verb, matching Docker's convention.

14. Setting stream: false in an Ollama API request means:

Correct. stream: false tells Ollama to collect the full response before returning it as one JSON object. Simpler for scripts, but the client waits for the complete response before receiving anything.

stream: false affects response delivery only — the complete response arrives as one chunk rather than token-by-token. It has no effect on GPU use, token limits, or queue priority.

15. Apple Silicon Macs have a significant advantage for running local LLMs because:

Correct. Unified memory means the GPU can access all system RAM at high bandwidth. A MacBook Pro M3 Max with 128 GB can run 70B models entirely in fast accessible memory — something a PC GPU with 24 GB VRAM can't do.

Apple Silicon uses unified memory — the CPU and GPU share the same memory pool at high bandwidth. This is why M-series Macs can run much larger models than a PC with a dedicated GPU that has separate, limited VRAM.