Personally Identifiable Information (PII)
Personally Identifiable Information (PII) is any data that can identify a person directly or indirectly. Learn how PII differs from GDPR personal data and why it matters for AI.
Direct and indirect identifiers
Not all PII identifies a person the same way. Practitioners distinguish two categories, and the distinction matters because indirect identifiers are far easier to overlook.
Direct identifiers
A direct identifier names a specific person without any additional context. A passport number, a national insurance or Social Security number, a full legal name paired with a date of birth, a personal email address, or a biometric template each point to one individual on their own. These are the patterns most data controls are tuned to catch.
Indirect identifiers (quasi-identifiers)
An indirect identifier does not single out a person by itself, but becomes identifying when combined with other data. A postal code, a job title, an employer, a gender, and a birth year are each unremarkable in isolation. Together, they can narrow a population to one person. Re-identification research has repeatedly shown that a small number of quasi-identifiers is often enough to uniquely identify individuals in supposedly anonymized datasets. This is why removing names alone does not make data non-personal.
PII versus related categories
"PII" is a term rooted in U.S. privacy practice. European data protection law uses the broader concept of "personal data," and most frameworks carve out a more tightly regulated subset for especially sensitive attributes. These categories overlap but are not interchangeable, and conflating them leads to under-protection.
| PII (U.S. usage) | Personal data (GDPR) | Sensitive / special-category data | |
|---|---|---|---|
| Scope | Data that identifies a specific individual | Any data relating to an identified or identifiable person | A defined subset warranting heightened protection |
| Breadth | Narrower; focused on identifying attributes | Broader; includes data merely relating to a person | Narrowest; an enumerated list of categories |
| Typical examples | Name, SSN, passport number, email address | The above plus IP addresses, device IDs, location, online identifiers | Health, biometric, genetic, racial or ethnic, religious, sexual-orientation data |
| Indirect data | Often treated as PII when combined | Explicitly covered when a person is identifiable | Covered, with stricter conditions for processing |
The practical takeaway: GDPR's "personal data" is wider than the classic notion of PII — an IP address or a device identifier may be personal data even if it would not be considered PII under a narrow U.S. reading. Sensitive or special-category data (health, biometrics, race, religion, sexual orientation, and similar) is a smaller set that nearly every framework subjects to stricter handling. When in doubt, treat the broadest applicable definition as the operative one.
Why PII matters for AI
AI tools have created a new, high-volume path for PII to leave an organization — one that most data controls were never positioned to watch. Four exposure patterns dominate.
Users pasting PII into prompts
The most common exposure is also the most mundane: an employee pastes a customer list, a support transcript, a CV, or a contract into a chatbot to summarize or rewrite it. The PII is now prompt text submitted to a third-party model, outside the channels that traditional data loss prevention inspects.
Model memorization and training
When prompts are used to train or fine-tune a model, PII contained in them can be retained in the model's parameters and, under some conditions, resurfaced later. Even where a provider states that inputs are not used for training, an organization that cannot inspect its own outbound prompts cannot verify what PII it has exposed or to whom.
Completions surfacing PII from connected systems
As AI assistants and agents are wired into internal systems — CRMs, ticketing tools, knowledge bases, databases — model completions can return PII drawn from those sources to a user who should not see it, or echo it into a downstream tool call. The sensitive data leaves not through an upload but through the model's output.
Regulatory exposure
PII is the object most privacy regulation is built to protect. Mishandling it through AI tools can implicate frameworks such as the GDPR in the EU, the CCPA/CPRA in California, and HIPAA for health information in the United States — among others. Obligations commonly include lawful basis for processing, data minimization, purpose limitation, and breach notification. Uninspected PII flowing into external models undermines an organization's ability to demonstrate any of these.
Governing PII across AI surfaces
Protecting PII in the AI era means adding inspection at the layer where AI activity actually happens, rather than relying solely on network egress or endpoint controls:
- Prompt inspection — PII is detected in the prompt before submission to any model, and can be redacted or blocked on policy match.
- Completion inspection — model outputs are checked before they reach the user, catching PII surfaced from connected systems.
- Agent tool-call governance — arguments passed to external tools and APIs by autonomous agents are inspected before execution.
- Semantic detection — beyond regex for structured identifiers, semantic analysis helps catch indirect identifiers and PII embedded in free-form natural language.
- Audit trail — every inspected interaction is recorded, so an organization can demonstrate what PII was handled, where, and under what policy.
Questions a PII governance capability answers
- Is PII being pasted into external AI tools? — Prompt-level detection with redaction or block.
- Did a model return personal data from a connected source? — Completion inspection before display.
- What PII did this AI agent send to an external API? — Tool-call argument inspection and audit.
- Which users and tools handle the most PII? — Usage analytics across AI surfaces.