When llama.cpp landed in early 2023, thousands of developers suddenly had the ability to run Meta's LLaMA models on consumer hardware. The capability existed — but harnessing it required compiling C++ from source, manually downloading multi-gigabyte GGUF weight files from Hugging Face, crafting launch commands with a dozen flags, and repeating every step for each new model. The friction was enormous.
In late 2023, Jeffrey Morgan and collaborators shipped Ollama as an open-source project. It wrapped the llama.cpp inference engine in a clean CLI and a local REST server, added automatic model downloads, and packaged the whole thing as a single installer. By early 2024, the project had accumulated over 40,000 GitHub stars and was being pulled into enterprise pilots at companies including Shopify and Slack for offline developer tooling.
Ollama is a model runner — not a model itself. It sits between you and the underlying inference engine (llama.cpp, or for some backends, Metal/CUDA kernels), providing three things: a model registry, a runtime manager, and a local API.
Think of it like Docker, but for language models. Just as Docker packages apps with their dependencies so they run consistently anywhere, Ollama packages models with their configuration — quantization level, context length defaults, system prompt templates — into a single artifact called an Ollama Model.
The Ollama team deliberately chose architecture that mirrors Docker's design. ollama pull mirrors docker pull. ollama run mirrors docker run. The local registry at ~/.ollama/models mirrors Docker's image store. This design choice made adoption dramatically faster among developers already comfortable with container workflows.
Ollama is not a chat interface — it has a minimal terminal REPL, but most users pair it with a separate front-end like Open WebUI or Enchanted. It is not a training or fine-tuning tool. And it is not a cloud service; every inference call runs entirely on your local machine against a server process bound to localhost:11434.
Beyond the CLI, Ollama exposes a REST API at http://localhost:11434. This means any application that can make an HTTP call — a Python script, a Node.js app, a shell script — can use locally running models without any API key or cloud dependency. The API also implements the OpenAI-compatible endpoint at /v1/chat/completions, meaning tools built for the OpenAI API work against Ollama with a single URL change.
This OpenAI compatibility layer was added in Ollama v0.1.24 (November 2023) and has been one of the most consequential features for enterprise adoption — it means switching from cloud to local inference requires changing one line in most codebases.
GitHub Copilot alternatives like Continue.dev and Cody both support Ollama as a backend, allowing developers to run coding assistants entirely offline. As of 2024, Continue.dev listed Ollama as its recommended local backend for teams with data-residency requirements, specifically citing healthcare and legal use cases.
You've just learned how Ollama wraps llama.cpp with a Docker-inspired interface. In this lab, discuss Ollama's design with your AI lab assistant. Explore why the Docker-style architecture matters, what the OpenAI compatibility endpoint changes for developers, and how Ollama differs from running llama.cpp directly.
Complete at least 3 exchanges to finish the lab.
When Mozilla's Innovation team ran an internal pilot of Ollama for offline developer tooling in early 2024, their setup guide noted that the most common failure mode wasn't installation — it was verification. Developers would install Ollama, assume it was running, then wonder why their IDE extension wasn't connecting. The fix was always the same: check that the Ollama service was actually started and that port 11434 was listening. This lesson teaches you to install correctly and verify confidently.
Ollama runs on macOS 11 Big Sur or later (Apple Silicon recommended but Intel supported), Linux (x86_64 and ARM64), and Windows 10/11 (via WSL2 or the native preview installer). GPU acceleration requires NVIDIA (CUDA 11.3+) or AMD (ROCm 5.7+) on Linux, or the Apple GPU on macOS via Metal.
Minimum RAM: 8 GB to run 7B models. 16 GB recommended. 32 GB+ for 13B–34B models. CPU-only mode works but is significantly slower — expect 2–5 tokens/second instead of 30–80 tokens/second on a modern GPU.
On macOS, Ollama installs as a menu bar application. When you launch it, the Ollama service starts automatically and places an icon in your menu bar. The CLI is also installed at /usr/local/bin/ollama. Apple Silicon Macs (M1/M2/M3/M4) use Metal for GPU acceleration automatically — no extra configuration needed.
On Linux, the install script detects your NVIDIA or AMD GPU automatically and installs the appropriate CUDA or ROCm libraries. The Ollama service is registered as a systemd unit that starts on boot. For servers without GPUs, the script installs CPU-only mode.
On Windows, Ollama runs natively (no WSL2 required as of v0.1.22). The installer handles path setup. NVIDIA GPU support works via CUDA drivers. AMD GPU support on Windows arrived in Ollama v0.1.29. If you have WSL2 installed, you can also run Ollama inside WSL2 and access it from Windows applications.
Three verification steps, in order:
On Linux, if ollama serve returns "address already in use," another Ollama instance (often the systemd service) is already running. Use sudo systemctl stop ollama before running manually, or simply use the service. Don't run both simultaneously.
After installing, run ollama run llama3.2:1b (a small model good for testing) and look at the output. Ollama prints which device it's using: you'll see something like "running on GPU: NVIDIA GeForce RTX 3080" or "running on CPU". If you expect GPU acceleration but see CPU, check that your GPU drivers are up to date and that Ollama's CUDA/ROCm libraries are correctly installed.
OLLAMA_HOST — Change the bind address (default: 127.0.0.1:11434). Set to 0.0.0.0:11434 to allow network access from other machines. OLLAMA_MODELS — Change where model files are stored (default: ~/.ollama/models). OLLAMA_NUM_PARALLEL — Number of parallel inference requests (default: 1). OLLAMA_MAX_LOADED_MODELS — How many models to keep loaded in RAM simultaneously.
curl http://localhost:11434 verify?ollama serve returns "address already in use" on Linux, what is the most likely cause?ollama serve manually then tries to bind to the same port — fix it by stopping the systemd service first.sudo systemctl stop ollama before running manually.Your AI assistant will present you with Ollama installation scenarios — things that commonly go wrong on real machines. Work through the diagnosis and fix for each scenario. Discuss verification steps, environment variables, and platform-specific considerations.
Complete at least 3 exchanges to finish the lab.
ollama run mistral, and got 'Error: could not connect to ollama app. Is it running?' What are the steps to diagnose and fix this?"When Meta released Llama 3 in April 2024, Ollama had it available as ollama pull llama3 within hours. Within 48 hours, it had been downloaded over 500,000 times from the Ollama registry — more than any previous model launch on the platform. The speed of availability became a defining advantage: model releases that previously required navigating Hugging Face's interface, finding the right GGUF conversion, and manually configuring llama.cpp could now be accessed with a single command.
Visit ollama.com/library to browse available models. Each model page shows available tags (sizes and quantizations), parameter counts, and VRAM requirements. The search is filterable by capability: code, vision, embedding, and so on.
The most commonly used models as of mid-2025:
Every model in the Ollama library has a default tag and optional variants. The default (e.g., llama3.2) points to the recommended quantization for most hardware. Specific tags let you choose different tradeoffs.
8 GB RAM: 7B models at q4_K_M (fits comfortably) or 3B models with room to spare. 16 GB RAM: 13B models at q4_K_M, or 7B models at q8_0. 32 GB RAM: 34B models at q4_K_M. 64 GB+ RAM: 70B models become feasible. Apple Silicon users benefit from unified memory — a MacBook Pro M3 Max with 128 GB can run 70B models entirely in fast RAM.
The first run of any model downloads the GGUF file (2–40 GB depending on size and quantization). After that, it's cached locally. When you run ollama run model, you enter an interactive REPL where you can type prompts. Type /bye to exit, or /help for in-REPL commands.
When benchmarking models, the Ollama team's own documentation suggests using llama3.2:3b for quick tests (fast download, runs on any machine) and mistral:7b-instruct-q4_K_M as a quality baseline for instruction-following tasks. For coding, qwen2.5-coder:7b consistently outperforms general models of the same size on code completion benchmarks as of early 2025.
ollama rm [model-name] removes the model from your local store. This is the only command that deletes model files.ollama rm. Ollama doesn't use delete, uninstall, or prune — rm mirrors the Unix convention for removal.Your AI assistant will present hardware scenarios with specific RAM, GPU, and use-case requirements. For each, work out which model and quantization tag you'd recommend and why. Discuss the tradeoffs between quality, speed, and memory for different Ollama model choices.
Complete at least 3 exchanges to finish the lab.
In mid-2024, the Continue.dev team published a case study describing how a healthcare software company used Ollama's REST API to build an internal code review assistant. The system pulled a custom Modelfile-configured version of CodeLlama with a system prompt tailored to their coding standards, then called it from a VS Code extension via the OpenAI-compatible endpoint. The entire pipeline — from engineer's code editor to inference result — never left the company's network. The key technical detail: using Ollama's Modelfile to bake in the system prompt meant the application code stayed clean and the model's persona could be updated without touching the application.
Ollama exposes two main API styles: its native API (simpler, Ollama-specific) and the OpenAI-compatible API (broader compatibility). Both are served from localhost:11434.
A Modelfile is a plain-text configuration that creates a new named model in your local Ollama registry. Think of it as a lightweight wrapper around an existing model that bakes in a system prompt, default parameters, and optionally a custom stop token format.
By default, the API streams tokens as they're generated. This is great for user-facing applications but requires handling streaming JSON. Set "stream": false to get a single complete response (easier for scripts and testing). For production applications, streaming almost always produces a better user experience.
As of mid-2025, Ollama is natively supported by: LangChain (OllamaLLM and OllamaEmbeddings classes), LlamaIndex (Ollama LLM and embedding integrations), Continue.dev (VS Code/JetBrains IDE extension), Open WebUI (browser-based chat front-end), Enchanted (macOS native app), AnythingLLM (document chat system), and Jan (cross-platform desktop app). In each case, you point the tool at localhost:11434 and it works with any model you've pulled.
Work with your AI assistant to design Modelfiles for specific use cases and plan how to integrate Ollama's API into real applications. Discuss system prompt design, parameter choices, and which integration approach (native API vs OpenAI-compatible) best fits different scenarios.
Complete at least 3 exchanges to finish the lab.
ollama list do?curl -fsSL https://ollama.com/install.sh | sh creates and enables the systemd service automatically. Ollama has no install --service flag, and it's not in standard apt/yum repos.ollama create [name] -f [Modelfile] is the command. After running it, the model appears in ollama list and can be run like any other.ollama create. Ollama doesn't use build, new, or package for this operation — create is the verb, matching Docker's convention.stream: false in an Ollama API request means: