PDG Fingerprint v1 Specification

Document status: Specification – v0.5.0 Location: docs/reference/finding-fingerprint-v1.md


Overview

The PDG fingerprint is a deterministic, content-addressed identifier for a pydepgate scan run and its findings. It is designed to let a third party independently reproduce a researcher’s claim by running pydepgate against the same artifact and comparing hashes, without needing to trust the researcher’s output file.

The fingerprint has two components:

Finding hash. A SHA-512 digest computed over a normalized, sorted representation of the active builtin findings from a scan. Suppressed findings and custom-rule findings are excluded. The hash is stable across re-scans of the same artifact when the finding set is identical.

Run envelope. A JSON object encoding run metadata that is base64-encoded and embedded in scan output as an armored block called PDG_FINGERPRINT.

The primary distribution format for a fingerprint is the findings-string: a short, portable identifier that encodes the run ID, the pydepgate version, and the finding hash in a single pasteable string.


Findings-string

The findings-string is the canonical form of a PDG fingerprint. It is the value passed to pydepgate validate and the value generated by pydepgate cite. Format:

<run_uuid>:pydepgate<version>:<findings_hash>

Where:

  • <run_uuid> is the UUID7 of the original scan run in standard 36-character lowercase hyphenated form. Example: 550e8400-e29b-71d4-a716-446655440000.
  • pydepgate<version> is the literal string pydepgate followed by the full pydepgate version string with no separator. Example: pydepgate0.5.0. Pre-release segments are included verbatim (e.g. pydepgate0.5.0a1).
  • <findings_hash> is the URL-safe base64 encoding (no padding) of the raw SHA-512 digest of the finding hash recipe defined below.

Example:

550e8400-e29b-71d4-a716-446655440000:pydepgate0.5.0:47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU

The three segments are separated by :. Colons do not appear within any segment. The findings-string is safe to paste into terminals, embed in report headers, and pass as a CLI argument.


Finding hash

Purpose

The finding hash answers: “Did this version of pydepgate, applied to this artifact, produce this set of findings?” It is stable across:

  • Re-scans of the same artifact on different machines.
  • Re-scans after pydepgate upgrades that do not change detection logic for the affected signals.
  • Scans on different operating systems (path normalization handles separator differences).

It is deliberately unstable across:

  • Any change to the active finding set.
  • pydepgate version changes that alter which signals fire or where they fire. The version string in the findings-string makes this explicit.

Scope

Only findings produced by builtin signal IDs are included. A finding is builtin if and only if its signal.signal_id is present in the default signal catalog at the time the scan runs. Custom rules that suppress, promote, or reclassify a builtin signal do not affect inclusion, but a custom rule that suppresses a builtin finding removes it from the active set and therefore changes the hash. Running with a custom rules file that affects any builtin finding will produce a different hash than running without one.

Suppressed findings are excluded entirely. Decoded-tree child findings are excluded; the fingerprint covers only Phase C static findings.

Severity is not part of the recipe. Severity is a presentation-layer concern for end consumers. The cite and validate commands operate without severity filters.

Input fields per finding

For each active builtin finding, extract and normalize:

Field Source Normalization
signal_id finding.signal.signal_id Lowercase
internal_path finding.context.internal_path Forward slashes; strip leading ./
line finding.signal.location.line Integer as decimal string; 0 when absent
column finding.signal.location.column Integer as decimal string; 0 when absent

No severity, description, context values, rule ID, analyzer name, or matched content is included.

Normalization steps

Path normalization. Replace all backslash characters (\) with forward slashes (/). Strip a leading ./ prefix if present. Do not modify the rest of the path. Apply before sorting.

Per-finding canonical string. Join the four normalized fields with a pipe character:

{signal_id}|{internal_path}|{line}|{column}

Example:

enc001|litellm/proxy/proxy_server.py|142|4

Sort. Sort the per-finding strings lexicographically, ascending, bytewise (equivalent to Python’s default str sort). The sort is applied over the fully-constructed canonical strings.

Join. Concatenate the sorted strings with a newline character (\n) as separator. No trailing newline.

Hash. Compute SHA-512 over the joined string encoded as UTF-8. Encode the raw digest bytes as URL-safe base64 with no padding (base64.urlsafe_b64encode(digest).rstrip(b'=').decode('ascii')). This is the <findings_hash> segment of the findings-string.

Empty finding set. When there are no active builtin findings, compute SHA-512 of the empty string and encode as above. This is a valid, unambiguous record that a scan ran and found nothing.

Python reference implementation

import base64
import hashlib

def compute_finding_hash(findings) -> str:
    """Compute the v1 finding hash.

    findings: iterable of Finding. Suppressed findings and custom-rule
              findings must already be excluded by the caller.

    Returns URL-safe base64 SHA-512 with no padding.
    """
    parts = []
    for f in findings:
        path = f.context.internal_path.replace("\\", "/")
        if path.startswith("./"):
            path = path[2:]
        signal_id = f.signal.signal_id.lower()
        line = str(f.signal.location.line) if f.signal.location.line else "0"
        column = str(f.signal.location.column) if f.signal.location.column else "0"
        parts.append(f"{signal_id}|{path}|{line}|{column}")
    parts.sort()
    joined = "\n".join(parts)
    digest = hashlib.sha512(joined.encode("utf-8")).digest()
    return base64.urlsafe_b64encode(digest).rstrip(b"=").decode("ascii")


def build_findings_string(run_uuid: str, pydepgate_version: str, findings_hash: str) -> str:
    return f"{run_uuid}:pydepgate{pydepgate_version}:{findings_hash}"


def parse_findings_string(s: str) -> tuple[str, str, str]:
    """Parse a findings-string into (run_uuid, pydepgate_version, findings_hash).

    pydepgate_version is returned without the 'pydepgate' prefix.
    Raises ValueError on malformed input.
    """
    parts = s.split(":")
    if len(parts) != 3:
        raise ValueError(f"malformed findings-string: {s!r}")
    run_uuid, version_tag, findings_hash = parts
    if not version_tag.startswith("pydepgate"):
        raise ValueError(f"malformed version segment: {version_tag!r}")
    return run_uuid, version_tag[len("pydepgate"):], findings_hash

Run envelope

Purpose

The run envelope records the context in which a findings-string was produced. It is the information needed to answer: “Which run produced this fingerprint, and what was being scanned?” The envelope is informational; the authoritative fingerprint for distribution is the findings-string.

Fields

Field Type Description
fingerprint_version integer Always 1 for this specification
findings_string string The complete findings-string as defined above
run_id string UUID4 of the scan run
pydepgate_version string Full pydepgate version string
timestamp string UTC ISO 8601 with timezone offset
artifact_identity string or null Artifact path or package name as passed to the scanner
package_name string or null Normalized package name; null for loose-file scans
package_version string or null Version string from package metadata; null when unavailable
artifact_ref string Content-addressed artifact reference (see below)
total_findings integer Count of active builtin findings; suppressed findings excluded

Artifact reference

The artifact_ref field identifies what was scanned independent of its location on disk.

File-backed artifact (wheel, sdist, loose file – any scan where artifact_sha512 is non-null):

sha512:{artifact_sha512}

Installed-environment scan (artifact_sha512 is null, package coordinates are known):

pkg:{package_name}=={package_version}

The == separator and the lowercase underscore-normalized package name are mandatory regardless of how the user spelled the package name.

Installed-environment scan without package coordinates (both artifact_sha512 and package coordinates are null):

identity:{artifact_identity}

This form is the least stable. The validate command notes when the artifact reference is identity-only and cannot confirm content equivalence.

JSON object

The envelope is serialized as a compact JSON object. Null fields are serialized as JSON null, not omitted. The JSON object is then base64-encoded (standard alphabet, no line wrapping, no padding stripped) to produce the armored payload.

{
  "fingerprint_version": 1,
  "findings_string": "550e8400-...:pydepgate0.5.0:47DEQpj8...",
  "run_id": "550e8400-e29b-71d4-a716-446655440000",
  "pydepgate_version": "0.5.0",
  "timestamp": "2026-05-29T14:32:01.000000+00:00",
  "artifact_identity": "litellm-1.82.8-py3-none-any.whl",
  "package_name": "litellm",
  "package_version": "1.82.8",
  "artifact_ref": "sha512:3f3f9e40cc...",
  "total_findings": 7
}

Armored block format

The armored block is the machine-readable embedding of the run envelope in a scan report. It is what pydepgate validate reads when a findings file is supplied. The findings-string alone is sufficient for validation without a findings file.

-----BEGIN PDG FINGERPRINT-----
{base64_encoded_json}
-----END PDG FINGERPRINT-----
Do not modify the block above. It encodes the run ID, timestamp,
artifact reference, and a hash of the findings from this scan.
Altering it will cause `pydepgate validate` to reject the report.
To inspect the contents, base64-decode the block between the markers.

The base64 payload is a single unbroken line. The begin and end markers are on their own lines. The warning text begins on the line immediately after the end marker with no blank line between them. Color is never applied to the block or the warning text regardless of --color mode, so the block survives copy-paste without embedded escape codes.


Embedding by output format

Human format

The armored block appears after the statistics line and the Your Run ID: line that close human output:

{statistics line}
Your Run ID: {run_id}

-----BEGIN PDG FINGERPRINT-----
{base64_payload}
-----END PDG FINGERPRINT-----
Do not modify the block above. It encodes the run ID, timestamp,
artifact reference, and a hash of the findings from this scan.
Altering it will cause `pydepgate validate` to reject the report.
To inspect the contents, base64-decode the block between the markers.

A blank line separates the Your Run ID: line from the begin marker. The block and its warning are the last content written to stdout. The armored block is only emitted when --save-to-db is active.

JSON format

The run envelope JSON object is embedded at the top level as a validation key. The base64 encoding is not used in JSON output because the format is already machine-readable. Schema version 4 introduces this field; the schema_version field is bumped from 3 to 4 in the release that implements this specification.

{
  "report_type": "pydepgate_scan_result",
  "schema_version": 4,
  "artifact": { ... },
  "findings": [ ... ],
  "validation": {
    "fingerprint_version": 1,
    "findings_string": "550e8400-...:pydepgate0.5.0:47DEQpj8...",
    "run_id": "550e8400-e29b-71d4-a716-446655440000",
    "pydepgate_version": "0.5.0",
    "timestamp": "2026-05-29T14:32:01.000000+00:00",
    "artifact_identity": "litellm-1.82.8-py3-none-any.whl",
    "package_name": "litellm",
    "package_version": "1.82.8",
    "artifact_ref": "sha512:3f3f9e40cc...",
    "total_findings": 7
  }
}

The validation key is present when --save-to-db is active. It is omitted when --save-to-db is not passed. Consumers built against schema version 3 continue to work unchanged because the new key is additive and --save-to-db is opt-in.

SARIF format

The run envelope is embedded in automationDetails.properties.pdgFingerprint as the base64-encoded armored payload (the same string that appears between the markers in human output, without the markers themselves):

"automationDetails": {
  "id": "pydepgate/{scan_mode}/{run_id}",
  "properties": {
    "pdgFingerprint": "{base64_payload}"
  }
}

Consumers base64-decode the value and parse the resulting JSON to inspect the envelope. The SARIF embedding is only present when --save-to-db is active.


pydepgate validate command (v0.6.0 target)

validate is not implemented in v0.5.0. This section specifies its behavior for implementation reference.

pydepgate validate <artifact> <findings-string> [--findings-file PATH]
  • <artifact>: path to the artifact to scan. Accepts the same targets as pydepgate scan.
  • <findings-string>: the complete findings-string in uuid:pydepgateX.X.X:hash format. Generated by pydepgate cite (v0.7.0).
  • --findings-file PATH: optional. Path to a pydepgate report containing a PDG_FINGERPRINT block. When supplied, the run envelope is extracted and used to enrich the validation report. The findings-string in the block must match the <findings-string> argument or the command exits with code 2.

Behavior

  1. Parse the findings-string to extract run_uuid, pydepgate_version, and findings_hash.

  2. Check the pydepgate version in the findings-string against the running version. If they differ, emit a warning and continue. A version mismatch is noted as the likely cause if the hash does not match.

  3. Set the current run UUID to run_uuid from the findings-string for the duration of this invocation. Validation runs are not saved to the evidence database: they adopt an existing run ID and writing to the DB would corrupt the run record for the original scan.

  4. Scan the artifact using the same static analysis pass as pydepgate scan. No --save-to-db behavior, regardless of global flags.

  5. Compute the finding hash using the recipe above.

  6. Compare the computed hash against findings_hash from the findings-string.

  7. Report the result.

Exit codes

Code Meaning
0 Hash matches. Artifact reproduces the claimed findings.
1 Hash does not match. Findings differ.
2 Version mismatch and hash does not match (version mismatch flagged as likely cause).
3 Scan error (artifact not found, unreadable, etc.).

Versioning

This document describes fingerprint version 1. The fingerprint_version field in the envelope exists to allow future incompatible changes without breaking parsers that need to support both old and new reports.

A fingerprint version increment is required when:

  • The set of per-finding input fields changes.
  • The normalization steps change in a way that produces a different hash for the same findings.
  • The envelope field set changes incompatibly (field removed or renamed).

A fingerprint version increment is not required for:

  • Additive new envelope fields (old parsers ignore unknown keys).
  • Changes to output format placement that do not affect hash computation.

Non-goals

The PDG fingerprint is not a cryptographic signature. It does not use asymmetric keys and cannot prove authorship, only consistency. It is not a tamper-evident seal for the full report – it covers only the normalized finding set, not descriptions, statistics, or diagnostics. It is not a replacement for SARIF partial fingerprints, which remain the mechanism for per-finding deduplication across runs in GitHub Code Scanning and similar consumers.