Module 4 · Lesson 1

What Technical Due Diligence Actually Examines

The anatomy of a TDD process — and what sophisticated investors probe before writing checks.

When a firm like a16z sends in a technical team, what are they actually looking for that a pitch deck cannot show?

In October 2022, Stability AI closed a $101 million seed round led by Coatue and Lightspeed. The investment came weeks after the public release of Stable Diffusion, which meant investors could probe a deployed system—not a demo. Technical reviewers could interrogate inference latency, model card documentation, licensing of training data, and compute cost per image generation directly. When the company later faced scrutiny about whether its training dataset included copyrighted material, the gap between what TDD surfaced and what it missed became a landmark lesson. Investors who focused on capability benchmarks underweighted data provenance risk—a structural blind spot that cost subsequent rounds significant valuation drag.

The Three Zones of Technical Due Diligence

Technical due diligence for AI companies differs materially from TDD for conventional SaaS. A SaaS reviewer can audit code quality, architecture scalability, and test coverage. An AI reviewer must additionally interrogate the model itself, the data that created it, and the operational regime under which it runs. These three zones each carry distinct risk profiles and require different reviewer specializations.

Zone 1 is Model Capability and Validity — can the model actually do what the founders claim? Zone 2 is Data and IP Integrity — is the model legally and technically reproducible? Zone 3 is Operational and Economic Viability — can the company serve customers at a unit economics profile that supports the projected business?

TDDTechnical Due Diligence — the formal process by which investors assess the technical merits, risks, and defensibility of a technology company prior to investment.

Model CardA structured document disclosing a model's intended use cases, evaluation results, training data characteristics, and known limitations — a primary artifact in Zone 1 review.

Data ProvenanceThe documented origin, chain of custody, licensing status, and processing history of training data — the central artifact in Zone 2 review.

Who Conducts the Review

Tier-1 venture firms typically assemble a review team that includes an internal technical partner, one or two domain-specific consultants (e.g., an ML engineer from a relevant industry), and sometimes a law firm specializing in IP. In 2023, Andreessen Horowitz formalized an AI-specific TDD protocol that includes red-team testing of model outputs against adversarial prompts — a practice that originated in defense and has migrated to commercial AI investment. Smaller firms often rely on a single external consultant, which creates gaps.

Founders who understand the structure of this review process can prepare documentation proactively rather than scrambling to answer questions during the diligence period. A well-prepared founder shortens the diligence cycle, signals operational maturity, and reduces the probability of renegotiated terms.

Investor Perspective

General Catalyst's Ken Chenault Jr., speaking at a 2023 MIT Investment Forum, noted that technical reviewers now spend roughly 40% of AI company diligence time on data provenance and legal exposure — a proportion that was under 10% in 2019. The shift directly followed high-profile IP litigation in generative AI.

The Six Standard TDD Artifacts

Experienced technical reviewers arrive expecting to see six categories of documentation. Founders who cannot produce these quickly signal organizational immaturity, even when the underlying technology is strong.

Artifact	What Reviewers Look For	Common Gap
Model Card / Technical Report	Benchmark methodology, evaluation datasets, failure mode documentation	Benchmarks chosen to flatter; no failure cases disclosed
Data Lineage Documentation	Source, license, processing pipeline, retention policy	Scraped data with no license audit; unclear consent
Architecture Diagram	System components, dependencies, single points of failure	Outdated diagrams that don't match production
Inference Cost Analysis	Cost per query at current and projected scale; GPU/TPU spend	Dev environment costs cited; production 10× higher
Security and Access Controls	Model weight protection, API authentication, audit logging	No access logging; model weights accessible to all engineers
Evaluation Test Suite	Reproducible evals; held-out test sets; bias testing	No held-out data; train/test contamination

Module Thesis

Technical due diligence is not a gatekeeping ritual — it is a structured information asymmetry reduction exercise. Founders who treat it as adversarial lose; founders who treat it as a collaborative disclosure process win trust and frequently negotiate better terms.

The Timeline Pressure Problem

In competitive rounds, investors compress TDD timelines aggressively. During the 2021–2022 bull market, some AI seed deals closed with two-week diligence windows. When Inflection AI raised $1.3 billion in June 2023, the presence of Microsoft and NVIDIA as strategic co-investors effectively pre-validated technical claims, shortening the commercial investor review. For most founders, no such shortcut exists. A compressed timeline rewards founders who maintain living documentation — artifacts that are continuously updated rather than assembled ad hoc when a term sheet appears.

Lesson 1 Quiz

What Technical Due Diligence Actually Examines

1. Which of the following best describes the primary difference between AI company TDD and conventional SaaS TDD?

Correct. The three-zone model — capability, data/IP, and operational viability — extends well beyond the code-and-architecture scope of conventional SaaS diligence.

Not quite. Review the three-zone framework: model capability, data and IP integrity, and operational/economic viability.

2. The Stability AI case illustrates that early technical diligence teams underweighted which specific risk?

Correct. Reviewers focused on capability benchmarks and underweighted whether training data was legally licensed — a gap that later caused significant legal and valuation problems.

Incorrect. The lesson explicitly highlights data provenance risk as the structural blind spot in the Stability AI diligence.

3. What does "living documentation" mean in the context of TDD preparation?

Correct. Living documentation means maintaining artifacts in a current state at all times, which allows a founder to respond to compressed diligence timelines without scrambling.

Not quite. Living documentation refers to the practice of maintaining TDD artifacts continuously rather than assembling them reactively when investment pressure arrives.

Lab 1 — TDD Artifact Audit

Practice with an AI advisor specializing in technical due diligence preparation

Your Task

You are preparing for an investor technical review of your AI startup. Use the AI advisor below to audit your current artifact readiness. Describe your company's stage and technology, then ask which of the six TDD artifact categories you should prioritize first and why.

Start by describing your AI startup in 2–3 sentences (real or hypothetical), then ask: "Which of the six TDD artifact categories is my biggest gap and what should I do first?"

TDD Readiness Advisor

Lab 1

Welcome. I'm your Technical Due Diligence preparation advisor. Tell me about your AI company — what it does, your current stage, and what technical artifacts you have on hand — and I'll help you identify your most critical gaps before an investor review.

Module 4 · Lesson 2

Model Capability Validation and Benchmarking

How to document what your model can actually do — and survive adversarial testing by technical reviewers.

When a technical reviewer red-teams your model, what are they trying to find — and how do you prepare for it without hiding weaknesses?

When OpenAI published the GPT-4 technical report in March 2023, the document was deliberately partial — it disclosed benchmark scores across dozens of academic and professional exams but withheld architecture details, training data specifics, and compute requirements for competitive reasons. This created an instructive tension: the report was simultaneously the most detailed public AI model disclosure to that point and a strategic communication artifact. Investors and enterprise customers receiving private briefings got more depth. The lesson for founders is that what you disclose and how you disclose it are separate decisions — but technical reviewers expect to receive more than the public version.

The Benchmark Credibility Problem

Benchmarks are the primary currency of model capability communication. They are also routinely gamed. In 2023, researchers at Stanford's Center for Research on Foundation Models documented systematic benchmark overfitting — models achieving high scores on standard evaluations while underperforming dramatically on structurally similar but novel tasks. Technical reviewers at sophisticated funds now actively probe for this.

The core question a reviewer asks is: did you choose benchmarks because they reflect your actual use case, or because your model performs well on them? Founders who can answer this question honestly — and demonstrate that they have used domain-specific evaluation datasets drawn from real customer workflows — build significantly more credibility than founders who cite MMLU scores as evidence of enterprise readiness.

MMLUMassive Multitask Language Understanding — a widely cited academic benchmark that tests knowledge across 57 subjects. Often cited by founders; often considered insufficient by technical reviewers for domain-specific applications.

Train/Test ContaminationThe condition where evaluation data was inadvertently included in training data, causing artificially inflated benchmark scores — one of the first things a technical reviewer checks.

Held-Out Test SetAn evaluation dataset never exposed to the model during training or development, providing the cleanest estimate of true generalization performance.

What a Red-Team Review Looks Like

A formal red-team review, as practiced by technical diligence teams at firms including Bessemer Venture Partners and General Catalyst, involves structured adversarial prompting across multiple dimensions. Reviewers test for failure modes the model card should have disclosed. When failures are discovered that the card did not mention, the credibility damage extends beyond the technical finding — it signals that the founder either didn't test thoroughly or chose not to disclose problems.

The counterintuitive preparation strategy is to conduct your own red-team before the investor does and to include those findings in your model card proactively. A model card that documents five known failure modes is more credible than one that documents zero. Reviewers understand that all models fail — what they are assessing is whether the founder understands how and where.

Real Practice — Hugging Face Model Cards

Hugging Face's model card template, adopted by over 200,000 model submissions as of 2024, includes mandatory sections for "Out-of-Scope Use," "Bias, Risks, and Limitations," and "Environmental Impact." Technical reviewers familiar with the format expect to see this structure. Founders who use it signal awareness of community standards; founders who present a two-paragraph capability summary signal inexperience.

Building a Domain-Specific Evaluation Suite

The most defensible capability documentation is a custom evaluation suite built from real customer data. This requires: (1) collecting a sample of representative real-world inputs from pilot customers or domain experts, (2) establishing ground-truth outputs verified by domain experts, (3) running the model against this dataset with results that are reproducible by a reviewer, and (4) documenting the methodology in sufficient detail that an external engineer could replicate the evaluation.

When Cohere prepared for its Series C in 2022, the company provided enterprise customers with private evaluation environments where customers could test Cohere models against their own data before committing to contracts. This same approach — letting the model speak on customer-representative tasks — is the gold standard for investor TDD.

Benchmark Type	Credibility with Reviewers	When Appropriate
Standard academic (MMLU, HellaSwag)	Low — easily gamed, not domain-relevant	Baseline comparison only
Industry leaderboard (LMSYS Chatbot Arena)	Medium — reflects real preferences but limited domain depth	General-purpose applications
Domain-specific curated set	High — relevant, harder to game	Vertical AI applications
Customer-validated blind test	Very high — real-world signal with third-party validation	Enterprise sales motion

Preparation Principle

Every claim in your pitch deck about model performance should map to a specific, reproducible evaluation. Technical reviewers are trained to ask "show me the eval" for every capability assertion. Founders who can produce it immediately, with methodology documentation, compress diligence timelines and build trust simultaneously.

Lesson 2 Quiz

Model Capability Validation and Benchmarking

1. Stanford CRFM research documented that high benchmark scores can coexist with poor performance on novel tasks. This is called:

Correct. Benchmark overfitting occurs when a model achieves strong performance on specific evaluation sets without genuinely generalizing to structurally similar novel tasks — a pattern that technical reviewers actively probe for.

Not quite. The lesson describes this as benchmark overfitting — achieving high scores on evaluations because the model (or its training process) has been optimized toward those specific tests.

2. A model card that documents five known failure modes is considered MORE credible than one documenting zero because:

Correct. Reviewers are not assessing whether a model is perfect — they are assessing whether founders understand how and where it fails. Proactive disclosure of failure modes signals rigor and honesty.

Incorrect. The credibility comes from demonstrating that the founder has conducted thorough testing and is willing to disclose findings honestly — the opposite of hiding problems.

3. Which evaluation type carries the highest credibility with technical reviewers during investment diligence?

Correct. Customer-validated blind testing on real-world data provides the strongest signal because it combines domain relevance with third-party validation — it is the hardest type to game.

Not quite. Review the benchmark credibility table. Customer-validated blind tests rank highest because they combine domain relevance with independent verification.

Lab 2 — Evaluation Suite Design

Build a credible model evaluation strategy with an AI benchmarking advisor

Your Task

Design a model evaluation strategy that will hold up under investor scrutiny. Describe your AI model's primary capability claim to the advisor, then work through what a credible, domain-specific evaluation suite would look like for your use case.

Start by stating your model's primary capability claim (e.g., "Our model extracts structured data from medical records with 94% accuracy"). Then ask: "What would a credible evaluation suite look like to support this claim during investor TDD?"

Evaluation Strategy Advisor

Lab 2

I'm your model evaluation advisor for investor due diligence preparation. State your model's primary capability claim, and I'll help you design an evaluation suite that is credible, reproducible, and resistant to the benchmark-gaming objections that technical reviewers raise.

Module 4 · Lesson 3

Data Provenance, IP Risk, and Legal Architecture

The legal infrastructure beneath your model — and how investors assess copyright, licensing, and data governance exposure.

If an investor's legal team asked for a complete audit trail of every dataset used to train your model, what would they find — and what should you have ready?

In January 2023, Getty Images filed suit against Stability AI in the US District Court of Delaware, alleging that the company had scraped and used over 12 million Getty images to train Stable Diffusion without licensing. The complaint noted that Stability AI's outputs sometimes reproduced Getty's watermark — a technically significant finding because it suggested that training data ingestion was extensive enough to memorize watermark patterns. For investors who had funded Stability AI, this case illustrated a core diligence failure: no systematic data provenance audit had been conducted before the close. The suit introduced material litigation contingency into the company's capitalization table and contributed to the leadership instability that followed in 2023.

The Four Data Provenance Questions

Technical reviewers and investment counsel now approach training data documentation through four sequential questions. A founder who cannot answer all four creates a contingent liability that will either reduce valuation or kill the deal.

Source: Where did each dataset originate? (Common crawl, licensed APIs, internal collection, third-party purchases, open-source repositories)
License: Under what terms is the data available for use? (CC0, CC-BY, research-only, commercial license, terms-of-service-governed)
Processing: How was raw data transformed? (Filtering, deduplication, annotation, PII scrubbing — each step should be documented)
Retention and Deletion: What is the data governance policy? Can specific data be identified and removed if a rights-holder issues a valid request?

Data LineageThe complete documented history of a dataset from its original source through all transformations to its use in model training — the primary artifact reviewed in IP due diligence.

Machine UnlearningTechnical methods for removing the influence of specific training examples from a trained model — increasingly relevant as regulators and courts explore data deletion rights in AI context.

ToS Violation RiskThe exposure created when training data was scraped from platforms whose terms of service prohibit automated data collection — a common gap in pre-2022 datasets.

Open-Source Data Licensing Complexity

Many founders believe that using open-source or publicly available datasets eliminates IP risk. This is incorrect. Several important distinctions apply. Creative Commons licenses vary significantly: CC-BY requires attribution; CC-BY-NC prohibits commercial use; CC-BY-SA requires derivative works to carry the same license. A model trained on CC-BY-NC data and sold commercially is in violation. The Common Crawl dataset, used by virtually every major LLM including GPT-3, LLaMA, and Mistral, contains scraped content from websites with widely varying terms of service — the legality of training on this data under the fair use doctrine is actively litigated as of 2024.

In November 2023, the Authors Guild v. OpenAI class action complaint alleged that OpenAI had used copyrighted books from LibGen — a shadow library — to train GPT models. OpenAI's defense relies substantially on fair use doctrine. For a startup without OpenAI's legal resources, equivalent exposure carries existential risk rather than manageable litigation cost.

Best-Practice Standard — Responsible AI Licenses

BigScience's BLOOM model (2022) introduced the RAIL license — Responsible AI License — which attaches use restrictions at the model level rather than the data level. Investors reviewing AI companies now commonly ask: "What license governs your model weights, and what does it permit downstream?" This is distinct from training data licensing and is an additional artifact category that post-2022 companies need to address.

Building a Data Governance Framework Investors Can Review

A reviewable data governance framework includes five components: (1) a data registry — a structured inventory of every dataset used, with source, license, version, and date of acquisition; (2) a license compatibility matrix — a documented analysis of whether the combination of licenses across datasets is compatible with commercial deployment; (3) a PII processing log — documentation of any personal information in training data and what was done to anonymize or exclude it; (4) a data retention and deletion policy — clear governance over how long raw training data is stored and who can access it; and (5) a legal opinion memo — ideally from outside counsel — assessing the company's IP exposure across its training data stack.

Investor Signal

Felicis Ventures partner Aydin Senkut noted in a 2023 interview that data provenance documentation is now a standard first-day document request in AI investment diligence. Companies that cannot produce a data registry within 48 hours of a term sheet trigger a diligence flag. Companies that produce it proactively — before being asked — signal sophisticated legal and technical leadership.

Lesson 3 Quiz

Data Provenance, IP Risk, and Legal Architecture

1. The Getty Images v. Stability AI case is cited in this lesson primarily because it illustrates:

Correct. The lesson uses this case to illustrate how a pre-investment failure to audit training data provenance resulted in substantial post-investment legal exposure — the kind of contingent liability that TDD should have surfaced.

Not quite. The case is referenced to show how missing data provenance documentation allowed IP litigation exposure to accumulate undetected until after investment.

2. A model trained on data licensed under CC-BY-NC and then sold as a commercial product is:

Correct. CC-BY-NC (Creative Commons Attribution-NonCommercial) explicitly prohibits commercial use. Training on this data and selling the resulting model commercially is a license violation that creates IP exposure.

Incorrect. NC stands for Non-Commercial. Training on NC-licensed data and selling the resulting commercial product violates the license regardless of revenue level or attribution provided.

3. Which of the following is NOT one of the five components of a reviewable data governance framework as described in this lesson?

Correct. Competitor benchmarking is not part of the data governance framework. The five components are: data registry, license compatibility matrix, PII processing log, data retention and deletion policy, and legal opinion memo.

That item IS one of the five components. Review the data governance framework section. The five components are: data registry, license compatibility matrix, PII processing log, data retention policy, and legal opinion memo.

Lab 3 — Data Provenance Audit

Work through your training data licensing exposure with an AI legal risk advisor

Your Task

Conduct a preliminary data provenance audit with the AI advisor. Describe your training data sources — where you got the data, what licenses apply, and how it was processed. The advisor will help you identify licensing exposure and prioritize remediation.

Start by listing your model's training data sources (e.g., "We used Common Crawl, a licensed dataset from DataProvider Inc., and customer-provided documents"). Then ask: "What are my biggest IP exposure points and what documentation should I create first?"

Data Provenance Risk Advisor

Lab 3

I'm your data provenance and IP risk advisor. Describe your training data sources — the datasets you used, where they came from, and what licenses you believe apply — and I'll help you identify licensing gaps, ToS exposure, and the documentation you need to survive investor legal review.

Module 4 · Lesson 4

Operational Architecture and AI Unit Economics

The infrastructure questions investors ask — and how compute cost, latency, and scalability determine whether your business model actually works.

What does it actually cost to serve your model at scale, and how do you defend that number when an investor's technical advisor runs their own calculation?

Character.AI, which raised $150 million at a $1 billion valuation from a16z in March 2023, faced intense scrutiny of its inference cost structure. The company served multi-turn conversational AI at consumer scale — at peak, running hundreds of millions of messages per day across personas. Technical reviewers from investment teams needed to model whether the economics of serving this volume at consumer price points (effectively free, ad-supported) could produce a viable business. The analysis required understanding not just raw compute cost per token, but session length distributions, model size tradeoffs, caching strategies, and batching efficiency. Character.AI had engineered substantial inference optimizations — including custom model distillation — that reduced per-token costs significantly. Founders who had not done equivalent analysis would not have been able to defend their unit economics in a comparable review.

The Infrastructure Review Checklist

Technical reviewers assess operational architecture across four dimensions: compute cost, latency profile, reliability and redundancy, and scalability ceiling. Each dimension requires specific documentation. Founders who present architecture diagrams without cost annotations leave the most important question unanswered.

Dimension	Key Metrics	What Reviewers Calculate
Compute Cost	Cost per inference, cost per token, GPU/TPU spend as % of revenue	Gross margin at scale; comparison to stated unit economics
Latency Profile	P50, P95, P99 response times; time-to-first-token	Whether latency meets stated SLA at customer scale
Reliability	Uptime history, failure modes, rollback capability	Whether redundancy architecture matches enterprise commitments
Scalability	Current peak capacity, bottlenecks, cost to double throughput	Whether growth projections are achievable at stated cost structure

Cost Per InferenceThe total compute cost of a single model invocation — including GPU time, memory bandwidth, networking, and infrastructure overhead. The fundamental unit of AI operational economics.

Model DistillationA technique where a smaller "student" model is trained to replicate the outputs of a larger "teacher" model — used to reduce inference costs while preserving significant capability.

Inference OptimizationEngineering techniques including quantization, batching, caching, and hardware selection that reduce the cost and latency of serving model predictions at scale.

The GPU Burn Rate Trap

One of the most common mismatches discovered during AI company TDD is the gap between development environment compute costs and production inference costs. In a development environment, engineers use powerful GPUs for long training runs, but the GPU is idle much of the time. In production, serving latency requirements mean GPUs must be reserved even during low-traffic periods, creating a fixed cost floor. Many early-stage AI founders calculate their unit economics from development-environment spot instance pricing — which can be 5–10× cheaper than reserved production GPU capacity at the reliability levels enterprise customers require.

During its Series B diligence in 2022, Cohere had to demonstrate not just current inference costs but a credible roadmap to improving gross margins as scale increased. The company's ability to show engineering initiatives — including custom CUDA kernels and inference batching improvements — that would systematically reduce cost per token over 18 months was a key element of the investment thesis. Founders who present static cost structures without improvement roadmaps signal engineering ceiling, not just current cost.

Technical Review — What a16z's AI Infrastructure Team Tests

According to Andreessen Horowitz's published AI investment framework (a16z.com, 2023), infrastructure reviews for AI companies include latency testing under synthetic load, cost modeling at 10× current volume, and an assessment of whether the company's current model size and architecture can be optimized or whether cost reduction requires fundamental re-architecture. Companies that require re-architecture to achieve target economics receive significantly higher risk assessments.

Building an Operational Readiness Package

A complete operational readiness package for TDD contains: a system architecture diagram with current production configuration annotated with monthly cost; a load testing report showing performance at 1×, 5×, and 10× current traffic; a cost modeling spreadsheet with current and projected cost per inference at revenue milestones; an uptime log from the past 90 days of production operation; and an optimization roadmap identifying the engineering initiatives that will reduce cost per inference over the next four quarters with rough sizing of impact.

This package addresses the most common investor objection about AI companies: that the business model depends on compute costs falling in ways outside the company's control. Companies that demonstrate active, owned optimization programs — rather than passive reliance on GPU price trends — command significantly stronger negotiating positions.

Module Completion Principle

Technical due diligence is ultimately an exercise in organized, honest disclosure. The founders who succeed are not those with perfect models or zero IP risk — they are those who have documented their system thoroughly, understand their own failure modes, and can answer every question a technical reviewer asks before the question is asked. Preparation is the competitive advantage.

Lesson 4 Quiz

Operational Architecture and AI Unit Economics

1. The "GPU burn rate trap" described in this lesson refers to founders who:

Correct. Development environment costs — especially spot instances — are dramatically cheaper than the reserved production capacity required to meet enterprise SLAs, creating a systematic underestimation of production unit economics.

Not quite. The trap is specifically about the pricing gap between development spot instances (cheap, interruptible) and the reserved production GPU capacity needed for enterprise-grade reliability.

2. An optimization roadmap is valued in TDD because it demonstrates:

Correct. Investors distinguish between companies that rely on GPU price trends (external, uncontrolled) and companies that have active engineering programs to reduce inference costs through optimization work they control.

Incorrect. The value of an optimization roadmap is demonstrating that cost reduction is an active, owned engineering program — not a passive bet on market conditions.

3. Which latency metric is most relevant for a user-facing AI application where the first response character appears before the full output is complete?

Correct. Time-to-first-token (TTFT) is the critical latency metric for streaming AI applications where user-perceived responsiveness is dominated by how quickly the first output token appears, not total generation time.

Not quite. For streaming applications, time-to-first-token (TTFT) is the most user-relevant latency metric because it determines how quickly the user perceives the system as responsive.

Lab 4 — Unit Economics Stress Test

Model your AI inference economics under investor scrutiny with an AI advisor

Your Task

Prepare your operational unit economics for investor review. Describe your current inference setup and pricing model to the advisor, then stress-test your assumptions against the objections a technical reviewer would raise.

Start by describing your inference setup: what model you serve, on what infrastructure, at what approximate cost, and what you charge customers. Then ask: "What assumptions in my unit economics would a technical reviewer challenge, and how should I prepare?"

AI Unit Economics Advisor

Lab 4

I'm your AI unit economics advisor for investor due diligence preparation. Tell me about your inference setup — what model you serve, the infrastructure stack, current costs, and your pricing model — and I'll help you stress-test your assumptions against the objections a technical reviewer will raise.

Module 4 — Technical Due Diligence Preparation

Module Test · 15 Questions · Pass at 80%

1. Technical due diligence for AI companies differs from conventional SaaS TDD primarily because AI TDD must assess:

Correct. The three-zone extension — model capability, data/IP, and operational viability — distinguishes AI TDD from conventional software diligence.

Incorrect. AI TDD extends beyond conventional code review into model validity, data provenance, and inference economics.

2. Which artifact category was most severely underweighted in the Stability AI pre-investment diligence, according to the lesson?

Correct. Reviewers focused on capability benchmarks and missed the data provenance risk that later produced material IP litigation.

Incorrect. Data provenance — specifically the licensing status of the training images — was the critical gap.

3. A technical reviewer who discovers a failure mode that was NOT disclosed in a model card will most likely conclude:

Correct. Undisclosed failure modes damage credibility not just technically but organizationally — they suggest either insufficient testing or intentional omission.

Incorrect. Undisclosed failures signal to reviewers that the founder lacks rigor or transparency — both are serious trust problems in an investment context.

4. The Hugging Face model card template is relevant to investor TDD because:

Correct. The Hugging Face format is a de facto community standard; reviewers who recognize it being followed interpret it as a signal of technical maturity and awareness.

Incorrect. The template is a community standard that signals technical competence, not a legal requirement or automation tool.

5. Train/test contamination refers to:

Correct. When evaluation data leaks into training, benchmark scores reflect memorization rather than true generalization — one of the first things technical reviewers check for.

Incorrect. Train/test contamination is a data pipeline problem where evaluation examples appear in training, producing misleadingly high benchmark scores.

6. According to the lesson, the Authors Guild v. OpenAI complaint alleged that OpenAI had used which data source without authorization?

Correct. The Authors Guild complaint specifically alleged unauthorized use of copyrighted books from LibGen, with OpenAI's defense relying on fair use doctrine.

Incorrect. The Authors Guild case involved LibGen, a shadow library of books — distinct from the Getty Images case, which involved images and Stability AI.

7. The BigScience BLOOM model introduced the RAIL license. What distinguishes RAIL from Creative Commons licenses?

Correct. RAIL (Responsible AI License) governs model use — what downstream users can do with the model — rather than data licensing terms.

Incorrect. RAIL's key distinction is that it applies at the model level, restricting certain uses of the model itself regardless of training data licensing.

8. Which of the following is considered a component of a reviewable data governance framework?

Correct. The PII processing log is one of the five framework components, alongside the data registry, license compatibility matrix, data retention policy, and legal opinion memo.

Incorrect. Review the five-component data governance framework: data registry, license compatibility matrix, PII processing log, data retention policy, and legal opinion memo.

9. Character.AI's engineering approach to reducing inference costs, cited in the lesson, primarily involved:

Correct. Character.AI used model distillation to reduce inference costs, enabling it to serve high-volume consumer conversations at economically viable cost per session.

Incorrect. The lesson specifically cites custom model distillation as Character.AI's inference cost reduction strategy.

10. The "GPU burn rate trap" most commonly results from founders using which type of pricing for unit economics calculations?

Correct. Spot instances are cheap and interruptible — appropriate for batch training but inappropriate for production serving where uptime SLAs require reserved capacity.

Incorrect. The specific trap is spot instance pricing — appropriate for training but up to 10× cheaper than the reserved production capacity needed for enterprise reliability.

11. According to the lesson, what percentage of AI company diligence time do technical reviewers now spend on data provenance and legal exposure?

Correct. General Catalyst's Ken Chenault Jr. cited approximately 40% of AI diligence time on data provenance and legal exposure — up from under 10% in 2019, driven by IP litigation in generative AI.

Incorrect. The lesson cites roughly 40%, up from under 10% in 2019 — a fourfold increase driven by high-profile IP litigation in generative AI.

12. A domain-specific curated evaluation set has higher credibility than MMLU benchmarks for a vertical AI application because:

Correct. Domain relevance and resistance to gaming are the key reasons. MMLU is an academic breadth benchmark that correlates poorly with specialized vertical performance.

Incorrect. The key advantages are domain relevance (the eval actually reflects the use case) and resistance to overfitting (harder to game than a well-known public benchmark).

13. An operational readiness package for TDD should include a load testing report showing performance at:

Correct. The lesson specifies 1×, 5×, and 10× current traffic — demonstrating that the system can handle growth projections, not just current load.

Incorrect. The lesson specifies testing at 1×, 5×, and 10× current traffic to demonstrate scalability across growth scenarios.

14. Felicis Ventures noted that companies producing their data registry proactively — before being asked — signal what to investors?

Correct. Proactive disclosure of the data registry signals that the team has thought carefully about IP risk before investors raised it — a marker of organizational maturity.

Incorrect. Proactive documentation signals sophisticated leadership; it is a positive signal, not a suspicious one.

15. According to the module's core thesis, the founders who succeed in technical due diligence are primarily those who:

Correct. Preparation — organized, honest, proactive disclosure — is the competitive advantage the module emphasizes. Technical perfection is less important than technical self-awareness and documentation discipline.

Incorrect. The module emphasizes that preparation and honest disclosure — not perfect technology — determine TDD success. Knowing your own failure modes and documenting them proactively is the key.