L1
Β·
Quiz
Β·
Lab
L2
Β·
Quiz
Β·
Lab
L3
Β·
Quiz
Β·
Lab
L4
Β·
Quiz
Β·
Lab
Module Test
Module 5 Β· Lesson 1

What Tech-Stack Fingerprinting Actually Reveals

Headers, banners, error pages β€” every byte a target leaves behind is a confession of what it runs.
How does passive observation of HTTP responses, DNS records, and job postings expose the full technology stack of an organization β€” before a single exploit is attempted?

In March 2017, the Apache Software Foundation disclosed CVE-2017-5638, a critical remote code execution flaw in Apache Struts 2. Equifax's internal systems ran a vulnerable version. The company's public-facing dispute portal leaked its Struts version in error-page stack traces β€” a textbook fingerprinting artifact. Attackers identified the exposure, exploited it within weeks, and exfiltrated records on 147 million people. The version string in the error response was, effectively, an invitation.

Why Tech-Stack Fingerprinting Matters

Technology fingerprinting is the process of identifying the software, frameworks, libraries, server platforms, and cloud services that a target organization uses β€” without requiring any privileged access. It is one of the highest-leverage activities in the reconnaissance phase because it directly narrows the attack surface to known vulnerabilities rather than requiring new discovery.

A defender's perspective is equally important: understanding what an adversary can infer about your stack from public signals is the first step in reducing that exposure. AI tools have transformed this discipline, accelerating the correlation of dozens of signals that a human analyst would previously spend hours aggregating.

The Four Primary Signal Sources

Tech-stack fingerprinting draws on four broad categories of observable data. Each can be gathered passively β€” no probes sent to the target.

Signal SourceWhat It RevealsExample Artifact
HTTP Response HeadersWeb server type, framework hints, CDN identity, cookie naming conventionsX-Powered-By: PHP/7.4.3, Server: nginx/1.18.0
JavaScript / HTML SourceFrontend frameworks, analytics platforms, CMS platform, third-party integrationsReact bundle hashes, WordPress admin paths, Shopify checkout scripts
DNS & Certificate RecordsCloud provider (AWS, GCP, Azure), CDN, mail infrastructure, subdomains pointing to SaaS toolsCNAME to *.cloudfront.net, certificate SANs listing staging hosts
Job Postings & GitHub ReposInternal tooling, preferred language versions, IaC tools, CI/CD pipelines"Must have 3+ years Apache Kafka on EKS" reveals container orchestration stack
Key Insight

Job postings are among the most underestimated fingerprinting signals. When LinkedIn shows a company hiring "Senior Confluence/Jira Administrator," that is public disclosure of internal tooling. When a GitHub org's public repos use a specific Terraform provider version, that pins the cloud platform and approximate deployment age.

Core Vocabulary

Banner GrabbingReading the version string a service voluntarily announces in its connection response. Classic example: SSH responding with SSH-2.0-OpenSSH_7.2p2.
Passive FingerprintingInferring stack components from publicly accessible data without sending any probes or requests to the target beyond normal HTTP browsing.
Version CorrelationMatching an identified version string against CVE databases (NVD, Mitre) to enumerate known vulnerabilities applicable to that exact release.
Technology MatrixThe structured output of a fingerprinting exercise: a map of every identified component, its version where known, confidence level, and associated CVEs.

How AI Amplifies This Process

Pre-AI, a skilled analyst building a technology matrix for a mid-size target might spend four to six hours correlating HTTP headers, Shodan data, GitHub activity, job postings, and Wappalyzer output into a coherent picture. AI-assisted workflows compress this to minutes by performing three tasks in parallel:

1. Signal aggregation: An AI model can be fed raw curl output, DNS zone data, and a paste of job requirements simultaneously and return a structured component list with confidence scores.

2. Version disambiguation: When headers report a generic version that maps to multiple CVEs across multiple branches, AI can reason about which branch is likely given other contextual signals (e.g., the Ubuntu LTS release cadence implied by the kernel version in an error page).

3. Gap identification: AI identifies which components it could not fingerprint and suggests what additional passive signals would resolve the uncertainty β€” e.g., "Check the SAN entries on the wildcard cert for any staging subdomains that might expose version-specific error pages."

Documented Reference

The 2020 SolarWinds supply-chain compromise involved attackers who had detailed knowledge of the target's build pipeline before initial access. Post-incident analysis by Mandiant (FireEye) found the attackers had spent months in passive reconnaissance, including reading public SolarWinds developer documentation and GitHub commit history to understand the exact build toolchain β€” a real-world demonstration of how open-source signals map a target's internal architecture.

Ethical and Legal Boundaries

All techniques in this module operate strictly on publicly available data β€” information the target has voluntarily published, or that arises as an incidental side effect of operating a public service. No exploitation, no unauthorized probing, no crawling beyond what a standard browser would perform. In authorized penetration testing contexts, fingerprinting typically falls under the "reconnaissance" phase explicitly scoped in the rules of engagement. Always confirm written authorization before applying these techniques against any system you do not own.

Lesson 1 Quiz

What Tech-Stack Fingerprinting Actually Reveals
1. In the 2017 Equifax breach, which specific artifact allowed attackers to identify the vulnerable software version before exploiting it?
Correct. The error-page stack traces in Equifax's dispute portal disclosed the Apache Struts version, giving attackers the exact CVE-2017-5638 target confirmation they needed.
Not quite. The specific fingerprinting artifact was the version string leaked in HTTP error-page stack traces from the dispute portal.
2. Which of the following is the best example of "passive fingerprinting"?
Correct. Reading response headers from a normal HTTP request is purely passive β€” no anomalous probe is sent, and the server is responding exactly as it would to any browser.
Incorrect. Nmap SYN scans, crafted payloads, and Metasploit auxiliary modules all involve active probing. Reading a standard response header is the passive approach.
3. Why are job postings considered a high-value fingerprinting signal?
Correct. Job requirements like "3+ years Kafka on EKS" or "experience with Confluence Data Center" directly disclose internal tooling, versions, and architectural choices.
Incorrect. Job postings reveal the required skills and tools β€” effectively publishing which software the organization uses β€” but do not contain network topology or license keys.

Lab 1 β€” Fingerprinting Signal Analysis

Use the AI assistant to interpret raw HTTP headers and identify tech-stack components

Scenario

You have captured the following raw HTTP response headers from a target's public web application during a scoped engagement. Feed them to the AI assistant and work through identifying every inferrable stack component, its confidence level, and any associated CVE surface.

Paste this into the chat:

Server: Apache/2.4.49 (Unix)
X-Powered-By: PHP/7.3.27
Set-Cookie: PHPSESSID=abc123; path=/; HttpOnly
X-Generator: Drupal 9 (https://www.drupal.org)
Via: 1.1 varnish
X-Varnish: 12345678
CF-Ray: 7d4f2a1b8-LHR

Ask: "What can I infer about this target's technology stack from these headers, and what CVEs should I investigate?"
AI Lab Assistant
Tech-Stack Fingerprinting
Welcome to Lab 1. I'll help you interpret HTTP response headers and build a technology matrix. Paste the headers from the scenario above β€” or any real headers you want to analyze β€” and ask me what the stack reveals. We'll work through component identification, version correlation, and CVE surface together.
Module 5 Β· Lesson 2

DNS, TLS Certificates, and Cloud Provider Enumeration

Certificate transparency logs were designed to protect users β€” they also hand researchers a complete map of every subdomain an organization operates.
How do certificate transparency logs, CNAME records, and TLS Subject Alternative Names reveal cloud infrastructure, internal hostnames, and staging environments without a single active probe?

In 2018, security researcher Hanno BΓΆck documented how certificate transparency (CT) logs β€” the public, append-only record of every TLS certificate issued β€” exposed thousands of internal and staging hostnames for major organizations including banks, government agencies, and Fortune 500 companies. Because CT logs are a mandatory part of the modern web PKI, organizations inadvertently published the existence of hostnames like internal-vpn.corp.example.com and staging-api-v2.example.com simply by obtaining TLS certificates for them. Attackers with access to crt.sh or Facebook's own CT monitoring service could enumerate these hosts passively and in real time.

Certificate Transparency Logs

Since 2018, Chrome has required all publicly trusted TLS certificates to be logged in CT logs before they are considered valid. This means every certificate issued for a domain β€” including internal staging hosts, VPN endpoints, and development servers that briefly obtained a Let's Encrypt certificate β€” is permanently and publicly recorded.

The Subject Alternative Name (SAN) field of a wildcard certificate often contains explicit enumeration of every hostname the certificate covers. A certificate for *.example.com might have SANs listing api.example.com, auth.example.com, gitlab.example.com, jenkins.example.com, and vault.example.com β€” each a potential reconnaissance target revealing the internal tooling landscape.

CT / DNS SignalWhat It ExposesTool to Query
SAN EnumerationAll hostnames on a certificate; staging, internal, and API subdomainscrt.sh, Censys
CNAME ChainsCDN provider (Cloudflare, Fastly, CloudFront), SaaS tools (HubSpot, Zendesk)dig, MXToolbox
MX RecordsEmail provider (Google Workspace vs Exchange Online vs self-hosted Postfix)dig MX, SecurityTrails
NS RecordsDNS hosting provider (Route53, Cloudflare DNS, self-managed BIND)dig NS, dnsx
SPF / DMARC TXT RecordsAll authorized email senders, often listing Salesforce, Mailchimp, and internal SMTP relaysdmarcian, MXToolbox
Operational Note

An SPF record alone can reveal a company's entire marketing and transactional email stack. A record like v=spf1 include:_spf.google.com include:sendgrid.net include:spf.protection.outlook.com ~all tells you the organization uses Google Workspace for internal email, SendGrid for transactional mail, and Microsoft 365 for something β€” potentially a subsidiary or a migrating environment. Each of these services has its own known vulnerability and phishing surface.

Cloud Provider Fingerprinting via DNS

CNAME records resolve to canonical hostnames that are typically provider-specific. Once you know the cloud provider, you can narrow the vulnerability surface to that provider's known misconfigurations and service-specific CVEs. Common mappings:

*.cloudfront.netAWS CloudFront CDN β€” implies S3 origin bucket or ALB backend; check for S3 bucket misconfiguration and CloudFront origin exposure.
*.azurewebsites.netAzure App Service β€” reveals Azure tenancy; check for subdomain takeover if CNAME exists but app is deprovisioned.
*.storage.googleapis.comGoogle Cloud Storage β€” check for public bucket misconfiguration via authenticated or unauthenticated listing.
*.elb.amazonaws.comAWS Elastic Load Balancer β€” reveals load-balanced EC2 or ECS backend; ELB access logs often reveal backend version strings.

Subdomain Takeover: When Fingerprinting Finds Abandoned Infrastructure

One of the most exploitable outcomes of CT log and DNS enumeration is identifying dangling DNS records β€” CNAME entries pointing to cloud resources that have been deprovisioned. In 2021, security researcher Frans RosΓ©n documented dozens of Fortune 500 subdomain takeovers including at Starbucks, where a subdomain pointed to a deprovisioned Azure Web App. Anyone who registered that Azure Web App name could host content under the Starbucks subdomain, enabling highly convincing phishing.

AI assistants accelerate this analysis by rapidly correlating a list of resolved CNAMEs against a known list of takeover-vulnerable providers and flagging any that return NXDOMAIN or specific "not found" responses associated with unclaimed resources.

AI Workflow Pattern

Feed a bulk export from crt.sh (JSON) to an AI model with the prompt: "Identify all subdomains that suggest internal tooling (CI/CD, secrets management, monitoring), all that suggest staging or development environments, and all CNAME targets that belong to providers known to be vulnerable to subdomain takeover." The model can process hundreds of records in seconds and return a structured priority list.

Lesson 2 Quiz

DNS, TLS Certificates, and Cloud Provider Enumeration
1. Certificate Transparency logs expose subdomain hostnames because:
Correct. Chrome's CT requirement (mandatory since 2018) means every certificate β€” including those issued for staging, internal, and VPN endpoints β€” must be logged publicly, exposing all hostnames in the SAN field.
Incorrect. CT logs exist because browsers like Chrome require certificates to be logged before they are trusted. This inadvertently publishes every hostname that obtains a TLS certificate, including internal and staging systems.
2. A DNS lookup shows: payments.example.com CNAME example.azurewebsites.net, but the Azure Web App returns "Error 404 β€” Web App Not Found." What is the likely security implication?
Correct. A dangling CNAME to a deprovisioned Azure Web App is the classic subdomain takeover scenario. Anyone who registers that Azure Web App name now controls what is served at payments.example.com, enabling highly credible phishing or credential harvesting.
Incorrect. When a CNAME points to a cloud resource that no longer exists, an attacker can claim that resource name and control what is served under the original domain β€” a subdomain takeover vulnerability.
3. What does an SPF record containing include:sendgrid.net include:_spf.google.com reveal to a fingerprinting analyst?
Correct. SPF include directives are a direct enumeration of every third-party service authorized to send email on the domain's behalf, revealing the SaaS and communication stack in one DNS record.
Incorrect. SPF include statements name every service authorized to send email on the organization's behalf. This is a direct disclosure of their email service stack β€” highly useful for fingerprinting the broader SaaS environment.

Lab 2 β€” CT Log & DNS Stack Mapping

Parse certificate and DNS data with AI to build a cloud infrastructure map

Scenario

You have queried crt.sh for example-corp.com and gathered DNS records. The data below is the kind of output you'd receive. Work with the AI to extract maximum intelligence about the target's cloud and SaaS footprint.

Share this data with the AI:

CT Log SANs found: gitlab.example-corp.com, jenkins.example-corp.com, vault.example-corp.com, staging-api.example-corp.com, vpn.example-corp.com, payments.example-corp.com

DNS findings:
gitlab.example-corp.com β†’ CNAME β†’ example-corp.azurewebsites.net (404 Not Found)
payments.example-corp.com β†’ CNAME β†’ d1abc.cloudfront.net
MX β†’ aspmx.l.google.com (priority 1)
SPF β†’ include:sendgrid.net include:_spf.google.com include:spf.mandrillapp.com

Ask: "Build me a technology matrix from this data, flag any subdomain takeover risks, and identify what each component tells us about the organization's security posture."
AI Lab Assistant
DNS & CT Log Analysis
Welcome to Lab 2. I'll help you build a technology matrix from certificate transparency and DNS data. Share the scenario data above and I'll walk you through identifying cloud providers, SaaS tools, subdomain takeover risks, and what the infrastructure tells us about the organization's security posture.
Module 5 Β· Lesson 3

Wappalyzer, Shodan, and AI-Assisted Version Correlation

Automated scanners collect the raw data. AI provides the analytical layer that turns a list of versions into a prioritized vulnerability roadmap.
How do tools like Wappalyzer and Shodan feed into AI-assisted workflows that automatically correlate detected versions with CVE databases and rank exploitability?

In May 2023, Progress Software's MOVEit Transfer product was found to contain a critical SQL injection vulnerability, CVE-2023-34362. Before the patch was issued, the Cl0p ransomware group had already conducted extensive passive reconnaissance using Shodan and Censys to enumerate every internet-facing MOVEit installation β€” approximately 2,500 organizations globally. The tool identifies itself via a distinctive login page and HTTP header pattern. Cl0p's preparedness to exploit immediately on disclosure demonstrated that the version correlation and targeting had been completed weeks in advance, entirely through passive fingerprinting of Shodan data.

Wappalyzer: Browser-Level Stack Detection

Wappalyzer is an open-source technology profiler that identifies web technologies by matching HTTP headers, HTML patterns, JavaScript variable names, cookie names, and meta tags against a continuously updated database of signatures. Its community-maintained ruleset covers over 3,000 technology categories including CMS platforms, JavaScript frameworks, analytics tools, CDNs, payment processors, and server software.

In an OSINT context, Wappalyzer can be used via browser extension against any public site, or the underlying data can be queried programmatically. The key output is a structured list of detected technologies with version numbers where extractable β€” exactly the input an AI model needs for CVE correlation.

Wappalyzer CategoryExample DetectionFingerprinting Signal Used
CMSWordPress 6.2.1Generator meta tag, admin URL pattern, script handles
JavaScript FrameworkReact 18.2.0Window.__REACT_DEVTOOLS_GLOBAL_HOOK__, bundle filename hash patterns
AnalyticsGoogle Analytics 4gtag.js with GA4 measurement ID format
Web Servernginx 1.24.0Server response header
Payment ProcessorStripejs.stripe.com script inclusion, Stripe-specific CSP directives
CDNCloudflareCF-Ray header, __cfduid cookie pattern

Shodan: Internet-Wide Service Enumeration

Shodan continuously crawls the entire IPv4 address space, storing banner data from open ports β€” HTTP, HTTPS, SSH, FTP, RDP, industrial protocols, and hundreds of others. Unlike Wappalyzer which operates at the application layer of a specific URL, Shodan provides infrastructure-level visibility: what services are listening on which ports, what version strings they announce, what TLS certificates they present, and what geographic/ASN data surrounds them.

For tech-stack fingerprinting, Shodan is most valuable for discovering non-HTTP services that application-layer tools miss: exposed databases (MongoDB, Elasticsearch, Redis with no authentication), industrial control systems, VPN appliances, and network devices that expose management interfaces to the public internet.

Real-World Signal β€” Shodan Search

A Shodan query for org:"Target Corp" product:"Elasticsearch" returns every Elasticsearch instance on IPs registered to that organization's ASN β€” including version, cluster name, and whether authentication is enabled. In 2017 and 2018, thousands of MongoDB and Elasticsearch instances were found completely open, with their contents deleted and replaced with ransom notes. The attackers used exactly this Shodan methodology to build target lists.

The AI CVE Correlation Workflow

Raw Wappalyzer and Shodan output gives you a list of components and versions. The next step β€” mapping each version to its CVE surface β€” is where AI provides the highest leverage. The workflow:

Step 1 β€” IngestFeed Wappalyzer JSON export and Shodan host report to the AI model as raw data.
Step 2 β€” NormalizeAI extracts a clean component:version list, resolving ambiguities (e.g., distinguishing "WordPress 6.2" from "WordPress 6.2.1" β€” the minor version matters for specific CVEs).
Step 3 β€” CorrelateFor each component:version, AI maps known CVEs (from NVD training data), their CVSS scores, and whether public exploits exist.
Step 4 β€” PrioritizeAI ranks findings by exploitability (CVSS v3 base score Γ— exploit availability Γ— exposure level) and produces an ordered attack surface summary.
Step 5 β€” Gap FlagAI explicitly identifies what could not be versioned and what additional passive signals would resolve the uncertainty.

Limitations and Accuracy Considerations

AI CVE correlation from fingerprinted versions has two key limitations that practitioners must understand. First, version numbers can lag behind patches: vendors sometimes backport security fixes without updating the version string (particularly common in enterprise Linux distributions like RHEL and CentOS). A server reporting Apache 2.4.6 on RHEL 7 may have had dozens of CVEs backpatched by Red Hat while the version string remains static.

Second, AI training data on CVEs has a cutoff date. For very recent vulnerabilities (within the past few months), the model may not have complete knowledge. Always cross-reference AI output with a live NVD query or a tool like trivy or grype for production decisions.

Documented Reference

The 2023 Barracuda Email Security Gateway compromise (CVE-2023-2868) followed a similar pre-disclosure reconnaissance pattern to MOVEit. Shodan data showed that ESG appliances expose a distinctive web UI with a version-specific login page. Mandiant's post-incident analysis confirmed that the attackers had precise knowledge of which ESG versions were vulnerable and targeted those specifically β€” consistent with prior passive fingerprinting of Shodan data.

Lesson 3 Quiz

Wappalyzer, Shodan, and AI-Assisted Version Correlation
1. How did the Cl0p group prepare to exploit CVE-2023-34362 (MOVEit) before the patch was even released?
Correct. MOVEit Transfer has a distinctive login page and HTTP header fingerprint. Cl0p used Shodan/Censys to build a complete list of approximately 2,500 internet-facing installations before the vulnerability was publicly disclosed, enabling immediate mass exploitation on day zero.
Incorrect. The documented approach was passive fingerprinting via Shodan and Censys β€” the distinctive MOVEit HTTP fingerprint allowed them to enumerate all ~2,500 internet-facing installations before the patch existed.
2. Why might a server reporting "Apache 2.4.6" on RHEL 7 NOT actually be vulnerable to Apache 2.4.6-era CVEs?
Correct. Enterprise Linux distributions like RHEL, CentOS, and their derivatives routinely backport upstream security patches into packages while freezing the version number for stability and compatibility. This is a critical caveat when doing version-based CVE correlation.
Incorrect. The key phenomenon is Red Hat's backporting practice: security patches from newer Apache versions are applied to the RHEL-packaged 2.4.6, but the version string stays the same. This means version-string-based CVE correlation can overestimate the actual vulnerability surface.
3. In the AI CVE correlation workflow, what is the purpose of the "Gap Flag" step?
Correct. The Gap Flag step makes the analysis actionable by explicitly naming what remains unknown and guiding the analyst toward specific additional signals β€” for example, suggesting a CT log query to find version-specific staging subdomains when the production version could not be determined.
Incorrect. The Gap Flag step identifies which components couldn't be versioned in the current dataset and recommends specific additional passive signals that could resolve those uncertainties β€” making the reconnaissance workflow iterative and thorough.

Lab 3 β€” AI-Assisted CVE Correlation

Feed Wappalyzer and Shodan output to the AI and build a prioritized vulnerability roadmap

Scenario

You have run Wappalyzer against a target's public-facing portal and obtained a Shodan host report for their primary IP. Use the AI to perform the full five-step CVE correlation workflow and produce a prioritized attack surface summary.

Share this data with the AI:

Wappalyzer output:
- WordPress 5.9.3
- PHP 7.4.3
- Apache 2.4.49
- jQuery 1.12.4
- WooCommerce 6.0.0

Shodan host report (same IP):
- Port 22: OpenSSH 7.4 (protocol 2.0)
- Port 3306: MySQL 5.7.38
- Port 6379: Redis 5.0.14 (no authentication required)
- Port 8080: Apache Tomcat 9.0.45

Ask: "Run the full CVE correlation workflow on this data. Normalize the component list, correlate CVEs for each version, rank by exploitability, and flag any critical issues."
AI Lab Assistant
CVE Correlation Engine
Welcome to Lab 3. I'll walk you through the full CVE correlation workflow β€” normalizing component versions, mapping CVEs, ranking by exploitability, and flagging critical exposures. Paste the scenario data above and let's build your prioritized attack surface map. Feel free to ask follow-up questions about any specific component or CVE.
Module 5 Β· Lesson 4

GitHub, Job Postings, and Supply-Chain Stack Inference

The most detailed technology disclosure documents an organization publishes are its job advertisements and its developers' open-source contributions.
How do public GitHub repositories, developer profiles, CI/CD configuration files, and job postings combine to expose an organization's complete internal technology stack β€” including components that produce no network-observable signals?

The Nobelium group (SVR-attributed) that executed the SolarWinds supply chain attack conducted extensive pre-intrusion reconnaissance of SolarWinds' development infrastructure. Post-incident analysis by CrowdStrike and Mandiant found evidence that the attackers had read publicly available SolarWinds developer documentation, studied GitHub repositories under the SolarWindsInc organization, and analyzed conference presentation slides by SolarWinds engineers that described the Orion build pipeline in detail. This open-source intelligence provided the blueprint for injecting the SUNBURST backdoor into the build process without triggering obvious anomalies β€” the attackers knew exactly what normal looked like because they had read the public documentation.

GitHub as an OSINT Gold Mine

Public GitHub repositories and organizational accounts reveal far more than source code. The following artifacts are routinely exposed and represent high-value fingerprinting data:

GitHub ArtifactWhat It RevealsExample
Dependency FilesExact library versions used in production; all transitive dependenciespackage-lock.json, requirements.txt, pom.xml, go.sum
CI/CD Config FilesBuild toolchain, test frameworks, deployment targets, cloud credentials variable names.github/workflows/*.yml, Jenkinsfile, .circleci/config.yml
IaC FilesCloud provider, region, service types, network architecturemain.tf with AWS provider and specific module versions
Docker / Compose FilesBase image (and its exact version/distro), sidecar services, database versionsFROM node:14.17.0-alpine in Dockerfile
Commit HistoryPreviously committed secrets (even if deleted), deprecated service names, internal hostnames in configgit log searching for API keys, git secret patterns
Release TagsSoftware version cadence, allowing inference of current version from release frequencyLast tag 8 months ago suggests a pinned, likely outdated version
Documented Exposure Pattern

In 2022, the Lapsus$ group openly discussed in their Telegram channel that they obtained initial access to multiple targets by searching GitHub for hardcoded credentials in configuration files that developers had accidentally committed. The group specifically mentioned searching for AWS access key patterns (AKIA[0-9A-Z]{16}) in public repos. GitHub's secret scanning feature now flags these, but many organizations had years of unscanned history containing active credentials.

Job Postings as Technology Disclosure Documents

A senior engineering job posting is effectively a partial architecture document. Requirements sections directly list the technologies an organization uses, and "nice to have" sections often name future investments. Key patterns to extract:

Language Version Pins"Python 3.8+" or "Java 11 (LTS)" pins the runtime version; Java 11 EOL in September 2023 means unpatched production systems if they haven't migrated.
Infrastructure Specifics"Experience with EKS 1.24 and Karpenter" reveals the Kubernetes version, the AWS compute region type, and the autoscaling toolchain β€” all relevant to k8s CVE surface.
Security Tool Stack"Familiarity with Tenable.io and Qualys" tells an attacker exactly which scanner signatures the organization's detection team is working from.
Data Platform Indicators"Experience with Snowflake, dbt, and Fivetran" identifies the data warehouse and ETL pipeline β€” and where sensitive data flows.

AI-Assisted Supply-Chain Inference

Once a technology matrix is assembled from GitHub and job posting data, AI can perform a higher-order analysis: supply-chain inference. This involves reasoning about which upstream open-source dependencies the identified stack relies on, cross-referencing those dependencies against known supply-chain compromise events (e.g., the 2021 UA-Parser-JS npm package compromise, the 2022 colors.js incident), and identifying whether the organization's dependency versions fall in affected ranges.

This analysis is particularly powerful because supply-chain vulnerabilities are often invisible to traditional perimeter scanning β€” the malicious code runs inside a legitimate, signed package that the organization intentionally installed. Only version-level analysis of the dependency tree reveals the exposure.

AI Workflow Pattern β€” GitHub Analysis

A highly effective prompt pattern: "Here is the content of this organization's public GitHub Actions workflow file. Identify every tool and action version pinned in this workflow, any deprecated or known-vulnerable action versions, any secrets or environment variables that suggest credential injection points, and what the overall CI/CD pipeline architecture reveals about their deployment target environment."

Combining All Signal Sources: The Full Technology Matrix

The highest-fidelity fingerprinting comes from combining all four signal categories β€” HTTP headers, DNS/CT logs, automated scanner data, and GitHub/job data β€” into a single unified matrix. AI is most valuable at this synthesis layer: reconciling conflicting signals (a job posting mentions Kubernetes while the production IP shows only Apache β€” perhaps k8s is backend-only), filling gaps by reasoning from adjacent evidence, and producing a confidence-scored final picture that a human analyst can act on without having to re-examine each raw data source.

Lesson 4 Quiz

GitHub, Job Postings, and Supply-Chain Stack Inference
1. How did the Nobelium/SVR group's use of public GitHub data contribute to the success of the SolarWinds SUNBURST attack?
Correct. Post-incident analysis showed the attackers had studied public documentation about the Orion build pipeline extensively. Knowing what "normal" looked like allowed them to inject SUNBURST in a way that mimicked legitimate build artifacts and avoided triggering alerts.
Incorrect. The documented finding was that the attackers used public documentation and GitHub repos to understand the build process deeply enough that their malicious injection could pass as legitimate β€” knowing the expected behavior made evasion possible.
2. A job posting requires "3+ years experience with EKS 1.24 and Karpenter." What specific fingerprinting intelligence does this provide?
Correct. EKS 1.24 pins the Kubernetes version, which has a specific CVE surface. EKS confirms AWS as the cloud provider. Karpenter is AWS-specific, confirming single-cloud AWS deployment. Together, these allow precise narrowing to known EKS 1.24 issues and AWS-specific misconfigurations.
Incorrect. The job requirement gives you: Kubernetes version 1.24 (specific CVE surface), cloud provider (EKS = AWS, not multi-cloud), and the autoscaling tool (Karpenter, which has its own IAM and RBAC implications). This is precise technical intelligence.
3. Why is supply-chain vulnerability analysis particularly difficult to detect with traditional perimeter security scanning?
Correct. A compromised npm package or Python library passes all signature checks and network inspection because it IS a legitimate, signed artifact β€” just with malicious code injected. Perimeter scanners have no visibility into whether a specific package version has been tampered with; only version-level dependency analysis reveals the exposure.
Incorrect. The fundamental reason supply-chain attacks evade perimeter scanning is that the malicious payload is inside a legitimately signed, intentionally installed package. There's nothing anomalous at the network layer β€” the detection requires analyzing which specific version of each dependency is installed and whether it falls in a compromised range.

Lab 4 β€” GitHub & Job Posting Stack Inference

Extract a complete technology matrix from open-source developer signals using AI analysis

Scenario

You have gathered GitHub CI/CD config and job posting data for a target organization. Use the AI to synthesize all signals into a unified technology matrix with supply-chain risk assessment.

Share this data with the AI:

GitHub Actions workflow (.github/workflows/deploy.yml):
- uses: actions/checkout@v2
- uses: actions/setup-node@v2 with node-version: '14'
- run: npm ci (package-lock.json shows log4js@2.9.0, axios@0.21.1)
- uses: aws-actions/amazon-ecr-login@v1
- Docker build FROM node:14-alpine
- Deploy to ECS via aws-actions/amazon-ecs-deploy-task-definition@v1

Job posting requirements:
"Experience with Node.js 14 LTS, Express.js, PostgreSQL 12 on RDS, Redis 5.x on ElastiCache, Terraform 0.14, Datadog APM, PagerDuty integration. Familiarity with OWASP ZAP for security testing."

Ask: "Build a complete technology matrix from this GitHub and job posting data. Identify any CVE-relevant version pins, supply-chain risks in the listed dependencies, and what the OWASP ZAP mention tells us about their security testing coverage."
AI Lab Assistant
GitHub & Supply-Chain Analysis
Welcome to Lab 4. I'll help you synthesize GitHub CI/CD data and job posting signals into a comprehensive technology matrix β€” including CVE-relevant version pins, supply-chain risks, and what the security tooling choices reveal about their detection coverage. Paste the scenario data and let's build the full picture.

Module 5 β€” Test

Tech-Stack Fingerprinting with AI Β· 15 questions Β· Pass at 80%
1. Which HTTP response header directly reveals the server-side scripting language and version?
Correct. X-Powered-By commonly contains values like "PHP/7.4.3" or "ASP.NET" and is one of the most direct version disclosure headers.
Incorrect. X-Powered-By is the header that typically discloses the server-side language and version.
2. Certificate Transparency logs were created primarily to:
Correct. CT logs were designed to catch mis-issuance by certificate authorities. The reconnaissance utility is an unintended but significant side effect of mandatory public logging.
Incorrect. CT logs exist to detect fraudulent or mistakenly issued certificates. Their value for OSINT fingerprinting is an unintended consequence of the mandatory public logging requirement.
3. A CNAME record for blog.target.com resolves to target.wpengine.com. What does this fingerprinting signal confirm?
Correct. A CNAME to *.wpengine.com is definitive evidence of WP Engine managed WordPress hosting. This also implies WordPress as the CMS and WP Engine's specific CDN and security stack.
Incorrect. A CNAME resolving to *.wpengine.com definitively identifies WP Engine as the managed WordPress host. WP Engine actively maintains its DNS so this is not a takeover scenario β€” it's an active, claimed resource.
4. Shodan differs from Wappalyzer primarily because:
Correct. Shodan scans the entire internet across all ports, collecting banner data from services like SSH, databases, industrial control systems, and VPNs. Wappalyzer focuses on identifying web technologies at the HTTP application layer for specific URLs.
Incorrect. The key difference is scope: Shodan provides infrastructure-level, multi-protocol visibility across the full IP address space, while Wappalyzer works at the HTTP application layer of specific web URLs.
5. Which artifact in a public Docker Hub image reveals the exact base OS distribution and version of a containerized application?
Correct. The FROM instruction (e.g., FROM ubuntu:20.04 or FROM node:14.17.0-alpine) directly names the base image, pinning the OS distribution and version β€” and therefore the OS-level CVE surface.
Incorrect. The FROM instruction in the Dockerfile names the exact base image. If the Dockerfile is public (or the image history is readable), this directly discloses the OS and its version.
6. What makes a "dangling CNAME" a security vulnerability?
Correct. A dangling CNAME pointing to a deprovisioned Azure Web App, S3 bucket, or Heroku app allows anyone who claims that resource name on the cloud provider to serve arbitrary content under the original organization's trusted domain.
Incorrect. The vulnerability is that an attacker can register the now-unclaimed cloud resource the CNAME points to, and then serve content (malware, phishing pages, fake login forms) under the original organization's trusted hostname.
7. An organization's SPF record includes include:spf.mandrillapp.com. What does this reveal?
Correct. SPF include directives authorize the named service to send email as the domain. Mandrill (now Mailchimp Transactional) is used for programmatic/transactional email sending β€” its inclusion reveals this SaaS dependency.
Incorrect. SPF include entries authorize a service to SEND email on the domain's behalf. Mandrill is Mailchimp's transactional email API β€” its presence means the organization sends programmatic email through Mailchimp.
8. In the Lapsus$ group's documented GitHub reconnaissance, what specific pattern did they search for to find credentials?
Correct. Lapsus$ specifically mentioned searching for the AWS access key ID prefix pattern AKIA followed by 16 alphanumeric characters β€” a format distinctive enough to find with regex searches across public code.
Incorrect. Lapsus$ documented searching specifically for the AWS access key pattern AKIA[0-9A-Z]{16}, which is recognizable enough to find with a simple regex across public repositories.
9. Why does Wappalyzer's detection of "WordPress 6.2.1" matter more for CVE correlation than simply detecting "WordPress 6"?
Correct. Security patches are delivered in minor and patch releases. WordPress 6.2 might have CVE-X while 6.2.1 patches it. Without the patch version, you cannot determine whether the specific fix has been applied.
Incorrect. Minor and patch version numbers matter enormously for CVE correlation because individual security fixes are delivered at that granularity. Version 6.2 and 6.2.1 may have completely different CVE exposure profiles.
10. A GitHub Actions workflow uses actions/checkout@v2. What fingerprinting intelligence does this provide beyond the action name?
Correct. Major-version tags like @v2 are mutable β€” they can be updated to point to new commits. Security best practice is to pin to a specific SHA. The absence of SHA-pinning tells you the organization may not follow GitHub Actions supply-chain hardening guidance.
Incorrect. The security-relevant insight is that @v2 is a mutable tag. Organizations that haven't pinned to specific SHAs are more susceptible to supply-chain attacks via compromised GitHub Actions, because the tag could be updated by an attacker who compromises the action author's account.
11. The "Technology Matrix" output of a fingerprinting exercise should include which four fields for each identified component?
Correct. A well-structured technology matrix captures component, version (which enables CVE correlation), confidence level (honest about uncertainty), and associated CVEs β€” the minimal set needed for actionable vulnerability prioritization.
Incorrect. The four key fields for an actionable technology matrix are: component name, version (where known), confidence level (acknowledging uncertainty), and associated CVEs β€” enabling direct prioritization of the vulnerability surface.
12. Red Hat's practice of backporting security patches without changing version numbers means that a server running "Apache 2.4.6 (RHEL 7)" may:
Correct. RHEL's backporting means the package version string is frozen while the actual binary receives security patches. You cannot reliably determine CVE exposure from the Apache version string alone on RHEL β€” you need the specific RPM package version (e.g., httpd-2.4.6-97.el7.x86_64).
Incorrect. RHEL backports patches into frozen-version packages. The Apache 2.4.6 on RHEL 7 may have dozens of upstream CVEs patched. Accurate CVE correlation requires checking the RPM package version, not just the Apache version string.
13. Which crt.sh query syntax would enumerate all subdomains of target.com from Certificate Transparency logs?
Correct. The % wildcard in crt.sh SQL-style queries matches any prefix, so %.target.com returns all certificates issued for any subdomain of target.com, surfacing internal and staging hostnames from CT logs.
Incorrect. The crt.sh wildcard syntax uses SQL-style %, so %.target.com matches all certificates issued for subdomains of target.com β€” the most effective query for subdomain enumeration via CT logs.
14. A job posting mentions "Familiarity with OWASP ZAP for security testing." What does this reveal from a fingerprinting perspective?
Correct. Knowing an organization uses ZAP (a DAST tool) means you know their scanner's detection signatures β€” what it will and won't catch. If ZAP is the only tool mentioned, it implies they may lack SAST (static analysis), IAST (interactive analysis), and SCA (dependency scanning), each representing coverage gaps.
Incorrect. ZAP is a specific DAST tool with known detection capabilities and blind spots. Knowing which security tool an organization uses tells you the signatures it tests against β€” and implicitly, the attack categories it may miss. It doesn't imply comprehensive coverage.
15. Which of the following best describes the AI's role in the "Version Disambiguation" step of tech-stack fingerprinting?
Correct. Version disambiguation uses reasoning across multiple contextual signals β€” if the server reports Ubuntu 20.04 LTS and Apache 2.4, AI can reason about which Apache version ships with that Ubuntu release and whether standard LTS patches have been applied, narrowing the CVE surface more accurately than a raw version string match alone.
Incorrect. Version disambiguation involves AI reasoning across contextual signals β€” OS type, release cadence, other detected components β€” to narrow which specific version branch and patch level is most probable, rather than treating every CVE for a major version as equally likely.