PIICatcher

Detect and tag PII across databases and data warehouses

PIICatcher scans databases and data warehouses for PII/PHI using regex and NLP, supports incremental scans, plugins, and integrates with Datahub and Amundsen for automated tagging.

Overview

PIICatcher is a command‑line and Docker‑based scanner that discovers personally identifiable information (PII) and protected health information (PHI) in relational databases and modern data warehouses. It combines regular‑expression matching on column names with natural‑language processing (via a spaCy plugin) to examine sample data, delivering high‑confidence classification of data types such as email, address, gender, and more.

Capabilities & Deployment

The tool supports incremental scans, allowing you to schedule recurring jobs that only evaluate new or unscanned columns, and offers flexible include/exclude filters for schemas and tables to control compute usage. Results can be exported directly to data catalogs like Datahub and Amundsen, where columns and tables are automatically tagged with the detected PII types. Extensibility is built‑in: developers can create custom metadata or datum detectors by subclassing provided base classes and registering them via Python entry points.

Getting Started

Deploy PIICatcher with a single Docker alias or install it from PyPI. A lightweight SQLite catalog stores scan state by default, while production environments can point to a persistent catalog backend. The project ships with plugins for major databases—including SQLite, MySQL, PostgreSQL, Redshift, Athena, Snowflake, and BigQuery—making it suitable for heterogeneous data landscapes.

Highlights

Regex‑based column name detection and NLP data sample analysis

Incremental scanning with schema/table include‑exclude filters

Plugin architecture for custom detectors (e.g., spaCy integration)

Direct ingestion into Datahub and Amundsen for automated PII tagging

Pros

Supports a wide range of databases and data warehouses
Extensible via plugins and custom detector classes
Incremental scans reduce compute cost for recurring jobs
Seamless integration with popular data catalog platforms

Considerations

Requires Python environment and familiarity with CLI/Docker
Plugin ecosystem is still growing; some detectors may be missing
No graphical user interface; operates via command line
Performance depends on data size and complexity of NLP models

Managed products teams compare with

When teams consider PIICatcher, these hosted platforms usually appear on the same shortlist.

Amazon Macie

Managed sensitive data discovery and protection for Amazon S3.

BigID

Data intelligence platform focused on data privacy, security, and governance through sensitive data discovery and classification

OneTrust

Unified trust platform for privacy, consent, data governance, and compliance automation.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Data engineering teams needing automated PII discovery across multiple sources
Organizations that already use Datahub or Amundsen for metadata management
Environments where Docker or Python deployment is standard
Teams comfortable extending functionality with custom Python plugins

Not ideal when

Real‑time streaming pipelines that require instant PII detection
Users seeking a full‑featured GUI for data classification
Small projects with only a few tables where the tool adds overhead
Organizations that need built‑in data masking or anonymization

How teams use it

Compliance audit of a data warehouse

Identify and catalog all PII locations to satisfy GDPR/CCPA requirements

Enriching a data catalog with privacy metadata

Automatically tag columns in Datahub or Amundsen, making PII searchable for downstream users

Scheduled incremental monitoring

Run nightly scans on newly added tables, maintaining continuous compliance with minimal compute

Custom domain‑specific PII detection

Develop and register a spaCy‑based plugin to detect industry‑specific identifiers

Tech snapshot

Python97%

Dockerfile1%

Shell1%

Frequently asked questions

Which databases does PIICatcher support?

SQLite, MySQL, PostgreSQL, AWS Redshift, AWS Athena, Snowflake, and Google BigQuery.

How does incremental scanning work?

PIICatcher records scanned columns in a catalog and only processes columns that are new or have not been scanned before, respecting include/exclude filters.

Can PIICatcher integrate with my existing data catalog?

Yes, ingestion functions are provided for Datahub and Amundsen to automatically apply PII tags to tables and columns.

Do I need to run a separate server for PIICatcher?

No, it runs as a CLI tool or Docker container. State is stored in a lightweight SQLite catalog by default.

How can I add custom PII detectors?

Create a class inheriting from MetadataDetector or DatumDetector, implement a detect method returning a PIIType, and register it via Python entry points or the API.

Project at a glance

Dormant

View repo

Stars: 338
Watchers: 338
Forks: 99

LicenseApache-2.0

Repo age6 years old

Last commit2 years ago

Primary languagePython

Last synced 5 hours ago

Overview

Overview

Capabilities & Deployment

Getting Started

Highlights

Pros

Considerations

Managed products teams compare with

Amazon Macie

BigID

OneTrust

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions