Applied AI Development · Introduction

Code Is Now the Easiest Part

This course exists because the barrier to building with AI collapsed — and most people haven't noticed yet.

In 1965, Gordon Moore observed that the number of transistors on a chip was doubling roughly every two years. His note in Electronics magazine was not a prophecy — it was a description of something already happening. Within a decade, that observation had reordered the computing industry entirely. Engineers who had been writing assembly instructions for room-sized machines found themselves holding pocket calculators. The skill that had made them valuable — manually managing every clock cycle — became a liability. The new premium was on architecture, judgment, and domain knowledge.

Something structurally similar is happening in software development right now. In March 2023, OpenAI released GPT-4 with a function-calling API. By late 2024, Anthropic's Claude, Google's Gemini, and a cascade of open-weight models — Mistral, Llama 3, Qwen — had made powerful language models accessible through a few lines of Python. GitHub Copilot crossed one million paid subscribers in 2023. Stack Overflow traffic fell 14% year-over-year in 2024 as developers routed questions to AI assistants instead. The bottleneck in software is no longer typing code. It is knowing what to build, understanding what the model is actually doing, and connecting the right tools reliably.

This course teaches you to build real AI-powered applications using Python, modern APIs, and the tooling ecosystem that has emerged around large language models. It is practical by design: every lesson ends in a hands-on lab where you write, test, and reason through actual code. You will leave with working knowledge of Python for AI, the OpenAI and Anthropic APIs, prompt engineering as a craft, vector databases, retrieval-augmented generation, and how to evaluate model outputs honestly. This course will not make AI feel magical. It will make it feel mechanical — and that is far more useful.

If you finish every module, here's who you become:

You'll understand how large language models actually work — not as magic, but as mechanical systems you can reason about and control.
You will write Python code that calls the OpenAI and Anthropic APIs, engineers prompts deliberately, and handles model behavior predictably.
You'll build a retrieval-augmented generation pipeline that connects a real knowledge base to a language model — and know exactly why each component exists.
You will evaluate your AI systems honestly, using structured testing methods that catch failures before users do.
You'll make informed decisions about when to fine-tune a model, when to use RAG, and when neither approach is the right tool.
You will ship a complete AI-powered application to production, with monitoring in place and the architecture to scale it responsibly.
You're becoming the kind of developer who understands what the model is doing, not just that it works — and that distinction is where leverage lives.

Lesson 1 · Python & AI Tooling

Your Development Environment Is Your First Product

A poorly configured workspace creates invisible bugs. Set it up right once.

In November 2022, the week after ChatGPT launched, Andrej Karpathy — then at Tesla, soon to return to OpenAI — posted a tweet that became widely quoted: "The hottest new programming language is English." The joke landed because it was partly true. But what the joke elided is that English instructions still flow through Python. Every major AI API — OpenAI, Anthropic, Cohere, Mistral, Hugging Face — is consumed via Python SDKs. Every deployment pipeline, every evaluation harness, every vector database client is a Python library. The language of AI infrastructure is Python. If you cannot set up a clean Python environment, install dependencies without breaking things, and structure a project sensibly, you will spend more time debugging your workspace than building anything.

Why Environment Setup Is Not Boring

Most tutorials skip environment setup or treat it as a two-line afterthought. This creates compounding problems. The Python ecosystem uses version-specific package resolution. A library installed globally can silently shadow the version your project needs. API keys stored carelessly in source files get committed to public repositories — a well-documented and expensive mistake. In 2023, GitGuardian reported detecting over 10 million secrets leaked on GitHub, the majority being API keys and tokens.

The professional approach is: one virtual environment per project, dependencies pinned in a requirements.txt or pyproject.toml, secrets loaded from environment variables, and project structure that separates code, configuration, and data from the start.

The Minimal AI Project Stack

You need exactly five things to start building AI applications. Everything else is optional until proven necessary.

Python 3.10+

The runtime. Use 3.10 minimum — structural pattern matching and improved type hints matter for readable AI code.

venv / pyenv

Virtual environment isolation. One per project. Never install AI libraries globally — dependency conflicts are inevitable.

pip + requirements.txt

Dependency management. Pin exact versions for reproducibility. Use pip freeze > requirements.txt after stabilizing.

python-dotenv

Loads API keys from a .env file into environment variables. The .env file never gets committed. Simple and effective.

VS Code or JupyterLab

Your editor. VS Code with the Python extension for scripts and modules. JupyterLab for exploration and prototyping.

Project Structure That Scales

Starting with good structure costs five minutes and saves hours. Here is the layout used throughout this course:

Recommended project layout

my-ai-project/

├── .env                  # API keys — never commit this

├── .gitignore            # includes .env, __pycache__, .venv

├── requirements.txt      # pinned dependencies

├── README.md

├── src/

│   ├── __init__.py

│   ├── main.py           # entry point

│   ├── config.py         # loads env vars, constants

│   └── utils.py          # shared helpers

├── notebooks/            # Jupyter notebooks for exploration

└── data/                 # local data files (gitignored if sensitive)

Setting Up Your First AI Environment

Here is the exact sequence. Run these commands in your terminal — this is the same sequence used by production teams:

Terminal — environment creation

# Create project directory

mkdir my-ai-project && cd my-ai-project

# Create and activate virtual environment

python3 -m venv .venv

source .venv/bin/activate        # Mac/Linux

# .venv\Scripts\activate         # Windows

# Install core AI libraries

pip install openai anthropic python-dotenv

# Pin your dependencies

pip freeze > requirements.txt

src/config.py — safe API key loading

from dotenv import load_dotenv

import os

load_dotenv()  # reads .env from project root

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")

if not OPENAI_API_KEY:

    raise ValueError("OPENAI_API_KEY not set in .env")

Your First API Call

Once your environment is configured, making an API call is genuinely simple. Here is the full working code to call OpenAI's gpt-4o-mini — a fast, cheap model ideal for development:

src/main.py — first working API call

from openai import OpenAI

from config import OPENAI_API_KEY

client = OpenAI(api_key=OPENAI_API_KEY)

response = client.chat.completions.create(

    model="gpt-4o-mini",

    messages=[

        {"role": "system", "content": "You are a helpful assistant."},

        {"role": "user", "content": "What is a virtual environment in Python?"}

    ],

    max_tokens=300

)

print(response.choices[0].message.content)

Cost Awareness

gpt-4o-mini costs $0.15 per million input tokens and $0.60 per million output tokens as of mid-2025. A 300-token response costs less than $0.0002. Use it freely during development. Switch to gpt-4o or Claude Sonnet only when capability requires it.

Virtual EnvironmentAn isolated Python installation with its own packages, separate from system Python and other projects.

.env fileA plain text file containing key=value pairs loaded as environment variables. Never committed to version control.

SDKSoftware Development Kit — the official Python library (e.g., openai, anthropic) that wraps API calls in convenient Python functions.

Chat Completions APIThe primary OpenAI endpoint. Accepts an array of messages with roles (system, user, assistant) and returns a generated reply.

Lesson 1 Check

Python & AI Tooling · 4 questions

What is the primary reason to use a virtual environment for AI projects?

Correct. Virtual environments create isolated Python installations — each project gets its own packages at its own versions, preventing the "it works on my machine" dependency conflicts that are extremely common in the AI library ecosystem.

Not quite. Virtual environments don't affect performance or encryption. Their purpose is dependency isolation — ensuring that installing a new library for one project doesn't break another.

Where should API keys like OPENAI_API_KEY be stored in a well-structured project?

Correct. The .env + python-dotenv pattern keeps secrets out of source code and version control. GitGuardian detected over 10 million leaked secrets on GitHub in 2023 — the vast majority were API keys committed by developers who didn't use this pattern.

Storing keys in source code or committed config files is one of the most common and costly security mistakes in software development. The .env file, excluded via .gitignore, is the right approach.

Which Python version is the minimum recommended for AI development in this course, and why?

Correct. Python 3.10 introduced structural pattern matching (match/case), improved union type syntax, and better error messages — all useful for readable AI application code. Python 3.12 is also fine; the key is avoiding anything below 3.10.

Python 3.10 is the recommended minimum. It introduced structural pattern matching, better union type hints (X | Y instead of Union[X, Y]), and clearer error messages — features that appear throughout modern AI application code.

In the OpenAI Chat Completions API, what does the "system" role message do?

Correct. The system message is the developer's primary control surface — it sets the model's persona, constraints, output format preferences, and behavioral rules. It persists across the conversation and is typically processed before user messages.

The system message provides persistent instructions to the model — its persona, constraints, and behavioral rules. Authentication is handled by the API key passed to the client constructor, not by message roles.

Lab 1 · Environment Setup & First API Call

Hands-on: configure your workspace, load keys safely, make your first call

What You're Doing

In this lab you'll work through setting up a Python AI project from scratch — virtual environments, dependency management, .env configuration, and making your first API call. Your AI lab assistant will guide you step by step and answer questions about any part of the process.

Work through these objectives in conversation. You can ask for clarification, request code examples, or ask the assistant to explain why each step matters.

Start by telling the assistant your operating system (Mac, Windows, or Linux) and whether you have Python installed. Then ask it to walk you through the full environment setup for this project.

Lab Assistant

Lesson 1 · Environment Setup

Welcome to Lab 1. I'm your lab assistant for this session — I specialize in Python environment setup, dependency management, and getting your first AI API call working.

To get started: tell me your operating system (Mac, Windows, or Linux) and whether you already have Python installed. I'll walk you through the complete setup — virtual environments, API key configuration, and your first working call to an AI model.

Lesson 2 · Python & AI Tooling

The OpenAI & Anthropic APIs in Depth

Understanding what the API actually does is the difference between using AI and building with it.

When OpenAI released the GPT-3 API in 2020, only developers on a waitlist could access it. By the time GPT-4 launched in March 2023, access was open, the Python SDK had a clean interface, and the documentation was thorough enough that a competent developer could go from zero to a working application in an afternoon. What had changed wasn't just capability — it was the design of the API itself. The chat completions format, the structured message array with roles, the parameters for controlling randomness and length: these were design decisions that made the API predictable enough to build on seriously. Understanding those parameters isn't optional. Temperature, max_tokens, stop sequences — each has a direct effect on what your application does.

The Chat Completions Request Object

Every call to the OpenAI Chat Completions API sends a JSON object and receives a JSON object. The Python SDK handles the serialization, but you should understand the underlying structure because it's what you're actually controlling.

Full request with key parameters explained

response = client.chat.completions.create(

    model="gpt-4o-mini",          # which model

    messages=[...],              # conversation history

    temperature=0.7,            # randomness 0.0–2.0

    max_tokens=512,             # max output length

    top_p=1.0,                  # nucleus sampling threshold

    frequency_penalty=0.0,      # penalize repeated tokens

    presence_penalty=0.0,       # penalize tokens already used

    stop=["###"],               # stop generation at this token

    n=1,                         # number of completions to generate

    stream=False                 # stream tokens as generated

)

Parameters That Actually Matter

Most parameters you will leave at their defaults. These three you will actively tune for every application:

temperatureControls randomness. 0.0 = deterministic (same output every time). 1.0 = default creative variance. Above 1.2 = often incoherent. For factual extraction tasks, use 0.0–0.2. For creative writing, 0.7–1.0.

max_tokensHard ceiling on output length. One token ≈ 0.75 English words. A 512-token response is roughly 380 words. Setting this too low truncates responses mid-sentence.

streamWhen True, the API returns tokens as they're generated (like ChatGPT's typing effect). Essential for user-facing applications where waiting 3–5 seconds for a response feels broken.

Streaming Responses

Streaming is not optional for production user-facing applications. A 500-token response at default speed takes 3–6 seconds to complete. With streaming, the user sees the first token in under a second. Here is the correct pattern:

Streaming with OpenAI SDK

with client.chat.completions.create(

    model="gpt-4o-mini",

    messages=[{"role": "user", "content": "Explain RAG in 3 sentences."}],

    max_tokens=150,

    stream=True

) as stream:

    for chunk in stream:

        delta = chunk.choices[0].delta.content

        if delta:

            print(delta, end="", flush=True)

print()  # newline after stream ends

The Anthropic API — Same Pattern, Different SDK

The Anthropic SDK follows a nearly identical pattern to OpenAI's. The key structural difference is that Anthropic separates the system prompt from the messages array — it's a top-level parameter, not a message with role "system". Claude models also have a distinct context window: Claude 3.5 Sonnet supports 200,000 tokens of context, versus GPT-4o's 128,000.

Equivalent call with Anthropic SDK

import anthropic

from config import ANTHROPIC_API_KEY

client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

message = client.messages.create(

    model="claude-3-5-haiku-20241022",  # fast, cheap

    max_tokens=512,

    system="You are a Python expert. Be concise.",

    messages=[

        {"role": "user", "content": "What is a generator in Python?"}

    ]

)

print(message.content[0].text)

API Comparison

Feature	OpenAI (gpt-4o-mini)	Anthropic (Claude Haiku)
System prompt	Message with role="system"	Top-level system= parameter
Context window	128,000 tokens	200,000 tokens
Response object	response.choices[0].message.content	message.content[0].text
Streaming	stream=True, iterate chunks	stream=True, message_stream
Price (input/output)	$0.15/$0.60 per MTok	$0.80/$4.00 per MTok (Haiku 3.5)

Which API to Use?

Use OpenAI gpt-4o-mini for development — it's the cheapest capable model. For production, benchmark both on your actual task. Claude models often perform better on long-document tasks and structured extraction. GPT-4o performs well on code generation and tool use. The right answer is always empirical, not tribal loyalty.

Lesson 2 Check

OpenAI & Anthropic APIs · 4 questions

You are building a document classification system that must produce consistent, reproducible results. What temperature setting should you use?

Correct. Temperature 0.0 makes the model deterministic — it always selects the highest-probability token, producing consistent output for the same input. This is critical for classification, extraction, and any task where reproducibility matters.

For classification tasks, you want temperature 0.0 — deterministic output. Higher temperatures introduce randomness that would cause the same document to be classified differently on repeated runs. Temperature is not about accuracy; it's about variance.

What is the key structural difference between how OpenAI and Anthropic handle system prompts?

Correct. OpenAI includes system instructions as the first message in the messages array with role="system". Anthropic extracts the system prompt to a top-level system= parameter separate from the messages array. This structural difference means you can't swap SDKs without modifying how you pass system instructions.

The key difference: OpenAI puts system instructions as a message with role="system" inside the messages array. Anthropic takes system as a separate top-level parameter outside messages. This is a real structural difference — not just syntax.

Why is streaming (stream=True) important for user-facing AI applications?

Correct. Without streaming, users stare at a blank interface while the model generates a full response — typically 3–6 seconds. With streaming, they see the first token in under a second, which dramatically improves perceived performance and reduces the experience of waiting.

Streaming is about perceived latency, not cost or quality. Without it, users wait 3–6 seconds for nothing to happen, then see the entire response appear at once. With streaming, they see text appearing immediately — the same experience as ChatGPT's typing effect.

You set max_tokens=50 for a response that requires about 80 tokens to complete naturally. What happens?

Correct. max_tokens is a hard cutoff. The model generates tokens until it hits the limit and stops — regardless of whether the thought, sentence, or response is complete. Always set max_tokens generously enough for your expected output length.

max_tokens is a hard ceiling. When reached, generation stops immediately — mid-sentence, mid-word if needed. The model doesn't compress or wrap up gracefully. Always set this value higher than your expected output length.

Lab 2 · API Parameters & Streaming

Hands-on: experiment with temperature, max_tokens, and streaming responses

What You're Doing

You'll work through experimenting with the key API parameters from Lesson 2 — temperature, max_tokens, stop sequences, and streaming. The lab assistant will guide you through concrete experiments and help you understand what each parameter change actually produces.

Start by asking the assistant to help you write a Python script that calls gpt-4o-mini twice with the same prompt — once at temperature 0.0 and once at temperature 1.0 — and prints both responses side by side so you can compare the difference.

Lab Assistant

Lesson 2 · API Parameters

Welcome to Lab 2. We're going to get hands-on with the API parameters that most directly affect what your application produces — temperature, max_tokens, stop sequences, and streaming.

The most instructive thing you can do right now is run the same prompt at different temperature settings and observe the difference. Ask me to help you write that comparison script, and we'll build from there into streaming and stop sequences.

Lesson 3 · Python & AI Tooling

Prompt Engineering as Engineering

Prompts are code. They have structure, edge cases, and failure modes — treat them accordingly.

In September 2023, a group of researchers at DeepMind published a paper titled "Large Language Models as Optimizers" showing that asking a model to improve its own prompt — using the model to do prompt engineering — outperformed human-written prompts on several benchmarks. The paper wasn't evidence that prompt engineering was trivial; it was evidence that it was difficult enough that automation was worth pursuing. What this field calls "prompt engineering" is not the same as writing better sentences. It is a systematic craft: specifying behavior precisely, constraining output format, providing examples that constrain the inference space, and testing against failure modes. The developers who treat prompts as engineering artifacts — versioned, tested, iterated — produce more reliable AI applications than those who treat them as magic incantations.

The Structure of an Effective Prompt

Every well-engineered prompt has four components. Not every prompt needs all four, but knowing which to include and why is the skill.

1

Role / Persona — Tell the model what kind of expert it is. "You are a senior Python developer reviewing code for production readiness." This shapes vocabulary, assumptions, and rigor.

2

Context / Background — Provide the information the model needs that it can't infer. The nature of your system, constraints that apply, what has already been tried.

3

Task / Instruction — State the task precisely. "Classify this customer email into one of: billing, technical support, feature request, or other. Return only the category name."

4

Output Format — Specify exactly what you want back. JSON with specific keys, a numbered list, a single word, a Python dict literal. The more constrained the output format, the easier to parse programmatically.

Few-Shot Examples

Few-shot prompting — including 2–5 examples of the input/output pattern you want — is the single most reliable way to improve model performance on structured tasks. Examples constrain the inference space more precisely than verbal instructions alone.

Few-shot prompt for email classification

SYSTEM_PROMPT = """You are an email classifier for a SaaS company.

Classify each email into exactly one category.

Categories: billing | technical | feature_request | other

Return only the category name in lowercase.

Examples:

Email: "My invoice shows the wrong amount"

Category: billing

Email: "The API returns 500 errors on /v2/export"

Category: technical

Email: "Can you add dark mode to the dashboard?"

Category: feature_request

Email: "Great product, love using it"

Category: other

"""

Chain-of-Thought Prompting

For complex reasoning tasks, adding "Think step by step before giving your final answer" measurably improves accuracy. Google's 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Wei et al.) showed this effect consistently across arithmetic, symbolic reasoning, and commonsense tasks. It works by encouraging the model to generate intermediate steps that serve as working memory.

Structured Output: Forcing JSON

Parsing free-text responses programmatically is fragile. OpenAI's response_format parameter (available on gpt-4o and gpt-4o-mini) forces valid JSON output. Combined with a precise prompt specifying the schema, this is reliable enough for production use:

Forcing JSON output with schema specification

import json

response = client.chat.completions.create(

    model="gpt-4o-mini",

    response_format={"type": "json_object"},

    messages=[

        {"role": "system", "content": """Extract entities from text.

Return JSON with keys: people (list), orgs (list), dates (list)."""},

        {"role": "user", "content": "On March 15, Sam Altman met with Satya Nadella at Microsoft HQ."}

    ],

    temperature=0.0

)

data = json.loads(response.choices[0].message.content)

print(data["people"])   # ['Sam Altman', 'Satya Nadella']

print(data["orgs"])     # ['Microsoft']

print(data["dates"])    # ['March 15']

Prompt Versioning and Testing

A prompt is a program. Version it. When you change a prompt, you are changing application behavior — you need to know that the new version is better than the old one, and on which inputs it might be worse. The minimum viable prompt testing harness:

1

Maintain a test set — 20–50 examples with known correct outputs. Start collecting these from day one.

2

Store prompts in files — prompts/v1/system.txt, prompts/v2/system.txt. Never hardcode prompts in application logic.

3

Run before/after comparisons — when you change a prompt, run your test set against both versions and compare accuracy or output quality.

4

Log production failures — when the model produces unexpected output, add that input to your test set immediately.

Zero-shotPrompting without examples. Works for simple, well-defined tasks. Unreliable for tasks with subtle output requirements.

Few-shotPrompting with 2–5 input/output examples. Substantially improves performance on structured tasks by constraining the output distribution.

Chain-of-ThoughtInstructing the model to reason step-by-step before answering. Measurably improves multi-step reasoning tasks.

response_formatOpenAI parameter that constrains output to valid JSON. Eliminates parsing failures from free-text responses when output structure is critical.

Lesson 3 Check

Prompt Engineering · 4 questions

You need to build a prompt that extracts structured data from unstructured text and returns it as parseable JSON. Which two techniques should you combine?

Correct. response_format={"type":"json_object"} forces valid JSON output. Specifying the exact schema (keys, types, structure) in the system prompt at temperature 0.0 ensures consistent, parseable output. This combination is production-reliable.

For reliable structured data extraction, you need response_format=json_object to guarantee valid JSON, a precise schema in the system prompt to specify the structure, and temperature 0.0 for consistency. Relying on the model to infer format from examples alone is fragile in production.

What does few-shot prompting accomplish that detailed verbal instructions alone cannot?

Correct. There is a gap between what we can describe in words and what we can demonstrate through examples — the same gap that exists in teaching humans a skill. Examples constrain the model's probability distribution over outputs in ways that verbal descriptions cannot fully replicate.

Few-shot examples work because they show rather than tell. A verbal description of what output format you want leaves more room for interpretation than showing 3–4 concrete input/output pairs. The examples narrow the space of plausible outputs more precisely than instructions alone.

Why should prompts be stored in separate files rather than hardcoded in application logic?

Correct. Prompts are software artifacts with their own change history. Storing them in files enables version control, before/after comparison testing, and updating prompt behavior without redeploying application code. Hardcoded prompts make iteration and testing significantly harder.

Prompts are a distinct layer of your application — effectively configuration for model behavior. Like any configuration, they should be versioned, testable, and updatable independently of application code. Hardcoding them conflates two things that change at different rates.

Chain-of-thought prompting works primarily because:

Correct. Language models generate text left to right, conditioning on everything already generated. By generating reasoning steps first, those steps become context the model can reference when producing the final answer — functioning as a form of scratchpad or working memory. This is the mechanistic reason chain-of-thought works.

Chain-of-thought works because language models are autoregressive — each token is conditioned on all previous tokens. When the model writes out intermediate steps, those steps become context it can build on, effectively extending its working memory beyond what can be computed in a single token prediction.

Lab 3 · Prompt Engineering Workshop

Hands-on: build, test, and iterate on structured prompts

What You're Doing

You'll build a complete prompt engineering workflow for a real task: classifying customer support emails into structured categories and extracting key information as JSON. You'll write the prompt, test it against failure cases, and iterate to handle edge cases.

Pick a domain you know well (e-commerce, SaaS support, HR tickets, etc.) and tell the assistant. Ask it to help you build a few-shot classification prompt with JSON output for that domain — then test it against edge cases together.

Lab Assistant

Lesson 3 · Prompt Engineering

Welcome to Lab 3 — the most hands-on session so far. We're building a complete prompt engineering workflow: design, few-shot examples, JSON output constraints, and systematic edge-case testing.

Start by telling me: what domain do you want to work in? Customer support emails are the classic exercise, but you can pick any classification or extraction task — job applications, bug reports, social media mentions, news headlines. Once I know the domain, we'll build the full prompt together and stress-test it.

Lesson 4 · Python & AI Tooling

Managing Conversations, Errors, and Cost

A working API call is the beginning. A production application handles everything that can go wrong.

In January 2024, a developer posted on Reddit that their startup had received an unexpected OpenAI bill of $14,000 for a single weekend. A bug in their conversation management code had allowed chat histories to grow unbounded — every new user message included the entire session history, which could run to 50,000 tokens per request. Multiplied by a viral launch that sent thousands of concurrent users into hour-long sessions, the cost was catastrophic. The root cause was not using the API incorrectly — the API was doing exactly what it was told. The problem was that nobody had thought through conversation memory management, token counting, or cost controls as engineering concerns. They are.

Conversation Memory: The Core Problem

Language model APIs are stateless. Every API call is independent. There is no persistent memory between calls. The illusion of conversation — of the model "remembering" what was said earlier — is created entirely by including previous messages in the messages array on each new call. This means conversation history grows with every exchange, and every token in the history is billed. Managing this growth is mandatory for any multi-turn application.

Conversation manager with sliding window

class ConversationManager:

    def __init__(self, system_prompt, max_history=20):

        self.system_prompt = system_prompt

        self.max_history = max_history  # max messages to retain

        self.history = []

    def add_user(self, text):

        self.history.append({"role": "user", "content": text})

        # trim oldest messages (preserve pairs)

        while len(self.history) > self.max_history:

            self.history.pop(0)

    def add_assistant(self, text):

        self.history.append({"role": "assistant", "content": text})

    def get_messages(self):

        return [{"role": "system", "content": self.system_prompt}] + self.history

    def chat(self, client, user_input):

        self.add_user(user_input)

        response = client.chat.completions.create(

            model="gpt-4o-mini",

            messages=self.get_messages(),

            max_tokens=512

        )

        reply = response.choices[0].message.content

        self.add_assistant(reply)

        return reply

Error Handling and Retry Logic

AI APIs fail. Rate limits, transient network errors, and occasional 500s are facts of production life. Applications that don't handle these gracefully will crash at the worst possible moments. The pattern is: catch specific exceptions, retry with exponential backoff for transient errors, and fail fast for permanent errors.

Robust API call with retry logic

import time

import openai

def call_with_retry(client, messages, max_retries=3, base_delay=1.0):

    for attempt in range(max_retries):

        try:

            return client.chat.completions.create(

                model="gpt-4o-mini",

                messages=messages,

                max_tokens=512

            )

        except openai.RateLimitError:

            wait = base_delay * (2 ** attempt)  # 1s, 2s, 4s

            print(f"Rate limited. Waiting {wait}s...")

            time.sleep(wait)

        except openai.APIConnectionError as e:

            if attempt == max_retries - 1:

                raise

            time.sleep(base_delay)

        except openai.BadRequestError as e:

            raise  # don't retry — fix the request

    raise RuntimeError("Max retries exceeded")

Token Counting and Cost Control

Use the tiktoken library — OpenAI's official tokenizer — to count tokens before sending requests. This lets you enforce cost limits and prevent the history growth bug described in the opening scene:

Token counting with tiktoken

import tiktoken

def count_tokens(messages, model="gpt-4o-mini"):

    enc = tiktoken.encoding_for_model(model)

    total = 0

    for msg in messages:

        total += 4  # overhead per message

        total += len(enc.encode(msg["content"]))

    return total

MAX_INPUT_TOKENS = 3000  # hard limit

def safe_chat(client, messages):

    tokens = count_tokens(messages)

    if tokens > MAX_INPUT_TOKENS:

        raise ValueError(f"Input too long: {tokens} tokens")

    return client.chat.completions.create(

        model="gpt-4o-mini", messages=messages, max_tokens=512

    )

Logging and Observability

Without logging, AI application failures are nearly impossible to diagnose. Every production AI application should log at minimum: the full messages array sent, the model response, token counts, latency, and any errors. The response object itself contains usage data:

Logging token usage from response object

response = client.chat.completions.create(...)

usage = response.usage

print(f"Input tokens:  {usage.prompt_tokens}")

print(f"Output tokens: {usage.completion_tokens}")

print(f"Total tokens:  {usage.total_tokens}")

# Cost estimate for gpt-4o-mini

input_cost = usage.prompt_tokens * 0.15 / 1_000_000

output_cost = usage.completion_tokens * 0.60 / 1_000_000

print(f"Estimated cost: ${input_cost + output_cost:.6f}")

Production Checklist

Before shipping any AI feature: ① Conversation history has a token or message count limit. ② All API calls have retry logic with exponential backoff. ③ Input token counts are validated before expensive requests. ④ Usage data is logged per request. ⑤ A monthly cost alert is configured in your API provider's dashboard.

Stateless APIEach API call is independent — the model has no memory of previous calls. Conversation context must be explicitly re-sent with every request.

Sliding WindowA memory management strategy that retains only the N most recent messages, discarding older ones to control token count and cost.

Exponential BackoffWaiting progressively longer between retry attempts (1s, 2s, 4s, 8s...). Prevents thundering herd problems and respects rate limit recovery windows.

tiktokenOpenAI's official Python tokenizer. Counts tokens before sending requests, enabling cost estimation and enforcement of input size limits.

Lesson 4 Check

Conversations, Errors & Cost · 4 questions

An OpenAI API call is stateless. What does this mean for multi-turn conversation applications?

Correct. The API has no memory between calls. Every call is a fresh request. If you want the model to "remember" the conversation, your application must store the message history and re-send it with every new call. This is both the source of multi-turn capability and the source of unbounded cost growth if unmanaged.

Stateless means zero persistence between calls. The API doesn't remember anything from previous requests. Your application code is entirely responsible for storing conversation history and re-sending it — which also means you're responsible for managing how much history you re-send.

You catch an openai.RateLimitError. What is the correct response?

Correct. Rate limit errors are transient — the API is telling you to slow down, not to stop permanently. Exponential backoff (waiting progressively longer between attempts) is the standard pattern. Retrying instantly in a loop makes the problem worse by sending more requests during the rate limit window.

Rate limits are transient. Immediate retries just hammer the rate limit further. Exponential backoff — waiting 1s, then 2s, then 4s — gives the rate limit window time to reset while not giving up too quickly. Re-raising immediately or switching keys doesn't address the underlying throttling.

What does the tiktoken library let you do, and why does it matter for production applications?

Correct. tiktoken is OpenAI's tokenizer — it tells you exactly how many tokens a message array will consume before you make the API call. This lets you enforce input size limits, estimate costs ahead of time, and prevent the runaway cost bug that comes from unbounded conversation histories.

tiktoken counts tokens before the API call happens. This is valuable because it lets you catch oversized requests before they're sent (and billed), estimate the cost of a request, and enforce conversation history limits programmatically. It doesn't compress or convert — just count.

The response.usage object returned by the API contains prompt_tokens and completion_tokens. What is the most important operational use of this data?

Correct. Token usage logged per request lets you track cumulative costs, set budget alerts, identify unusually expensive requests (which often indicate bugs), and understand which features or user behaviors drive the most API cost. Without this data, cost overruns like the $14,000 weekend bill are nearly impossible to prevent.

The primary operational value of usage data is cost observability. By logging prompt_tokens and completion_tokens per request, you can track total spend, set alerts for unusual spikes, and identify which workflows are expensive — all of which are prerequisites for responsible production deployment.

Lab 4 · Build a Production-Ready Chatbot

Hands-on: conversation management, error handling, token counting, and cost logging

What You're Doing

This is the capstone lab for Module 1. You'll build a complete, production-ready chatbot in Python that incorporates everything from the module: proper environment setup, a ConversationManager class with sliding window history, exponential backoff retry logic, tiktoken-based token counting before each call, and per-request cost logging. The lab assistant will guide you through each component and help you debug as you go.

Start by telling the assistant what your chatbot should do (customer support, code assistant, writing helper — your choice). Then ask it to help you build the ConversationManager class first. We'll add retry logic and token counting in subsequent steps.

Lab Assistant

Lesson 4 · Production Chatbot

Welcome to Lab 4 — the module capstone. By the end of this session you'll have a complete, production-ready Python chatbot: ConversationManager with sliding window history, exponential backoff retry logic, tiktoken token counting, and cost logging on every request.

First question: what should your chatbot do? Pick a domain — customer support agent, Python code reviewer, writing editor, cooking assistant, anything. Once we have a purpose, we'll write the system prompt and start building the ConversationManager class. What's your chatbot going to be?

Module 1 Test

Python & AI Tooling · 15 questions · Pass at 80%

1. What command creates a Python virtual environment named .venv?

Correct. python3 -m venv .venv uses Python's built-in venv module to create an isolated environment in the .venv directory.

The correct command is python3 -m venv .venv — using Python's built-in venv module.

2. Which file should be listed in .gitignore to prevent API key exposure?

Correct. The .env file contains secrets and must never be committed. It belongs in .gitignore from the start of every project.

The .env file stores API keys and must be excluded via .gitignore. config.py should only load from environment variables, not store keys directly.

3. What does python-dotenv's load_dotenv() function do?

Correct. load_dotenv() parses the .env file and populates the process environment, making keys available through os.getenv() without hardcoding them in source code.

load_dotenv() reads .env and sets environment variables. It doesn't encrypt, validate, or create files — just loads key=value pairs into the process environment.

4. In the OpenAI Chat Completions API, which role is used to provide persistent instructions and persona?

Correct. The "system" role message sets the model's persona and behavioral constraints. It's the developer's primary control surface for application behavior.

The "system" role is the correct answer. The three valid roles are: system (instructions), user (human input), and assistant (model output).

5. What temperature setting produces the most deterministic, reproducible model output?

Correct. Temperature 0.0 sets the model to always choose the highest-probability token, producing the same output for the same input every time.

Temperature 0.0 is deterministic. Higher values introduce increasing randomness. For reproducible classification or extraction, always use 0.0.

6. How does Anthropic's Claude API differ structurally from OpenAI's when passing a system prompt?

Correct. This structural difference means code written for one API cannot be directly swapped to the other without modifying how system instructions are passed.

Anthropic separates system from messages as a top-level parameter. OpenAI embeds it as the first message with role="system". Different structure, same purpose.

7. What is the primary benefit of streaming (stream=True) in user-facing applications?

Correct. Streaming is purely about perceived latency — users see the first token in under a second instead of staring at a blank screen for the full generation time.

Streaming doesn't change cost, accuracy, or rate limits. It changes perceived latency — the user sees output begin immediately, which dramatically improves the experience of waiting.

8. Which of the four prompt components is most important for ensuring parseable programmatic output?

Correct. For applications that parse model output programmatically, the output format specification is the most critical component. Specifying "return only the category name" or a precise JSON schema eliminates the ambiguity that causes parsing failures.

Output format specification is most important for programmatic use. Without it, even a well-instructed model may return its answer embedded in explanatory prose that breaks your parser.

9. The response_format={"type": "json_object"} parameter guarantees what?

Correct. response_format guarantees syntactically valid JSON. It does not enforce a specific schema — that's your prompt's job. You need both: the parameter for valid JSON, and precise schema instructions for the right structure.

response_format ensures valid JSON syntax, not a specific schema. The parameter prevents JSON parse errors but doesn't control which keys appear — that requires explicit schema specification in your prompt.

10. Few-shot prompting includes examples primarily to:

Correct. Examples work by narrowing the space of plausible outputs. They show rather than tell, constraining the model's probability distribution more precisely than verbal descriptions alone can achieve.

Few-shot examples constrain outputs by demonstration — showing the exact pattern rather than describing it. They don't teach new facts or trigger fine-tuning; they shift the probability distribution toward outputs that match the demonstrated pattern.

11. An AI API is stateless. What is the direct consequence for conversation management in your application?

Correct. Stateless means zero server-side memory. Your application owns the conversation history. You store it, you manage it, you re-send it. This gives you full control — and full responsibility for managing its growth.

No server-side session memory exists. Your code must maintain and re-send the full conversation history. This is both the source of conversational capability and the source of unbounded cost growth if history isn't managed.

12. What is exponential backoff, and when should it be applied?